Dependent and independent Variables in Machine learning

In today's internet powered world, there is no dirth of data. A vast majority of businesses are data driven and their sole task is to find the hidden information from this data.

Finding relevant information from the petabyte data generated everyday, manually, is like searching for the needle in a haystack. And therefore statistical techniques like regression play a significant role. They can help in generating the relationship between different features of the data automatically.

When I say automatically, I do not mean, you will not have to do anything, but just this, that if you give the right data, it can give you the right relationship between them.

The data consists of many features (or Variables), for example consider data on houses, in a city it might contain information like the area of the house, its location, the number of rooms, the number of floors, the type of property, the financial status of the residents, the price of the house etc. The first question that we need to tackle in this case is identifying which variables are independent and which variables depend on other variables. Or in other words, which variables we can or cannot predict using regression analysis.

Let us first try to give a formal definition of dependent and independent variables.

Independent Variables: The variable that are not affected by the other variables are called independent variables. For example age of a person, is an independent variable, two person's born on same date will have same age irrespective of how they lived. We presume that while independent variables are stable and cannot be manipulated by some other variable, they might cause a change in other variables, and thus they are the presumed cause.

Dependent Variables: The variables which depend on other variables or factors. We expect these variables to change when the independent variables, upon whom they depend, undergo a change. They are the presumed effect. For example let us say you have a test tomorrow, then, your test score is dependent upon the amount of time you studied, so the test score is a dependent variable, and amount of time independent variable in this case.

Dependent or Independent?

How do we know which are dependent variables and which independent variables? Well there are many ways and we will explore it in some of the standard machine learning datasets.

Boston House Price Dataset

The dataset was originally a part of UCI Machine Learning Repository and has been removed now. This dataset is today part of all machine (deep) learning frameworks such as Scikit-Learn, Keras, TensorFlow etc. It consists of 506 samples and each with 13 feature variables. Let us use Scikit-Learn here.

In [2]:
# Load the dataset
from sklearn.datasets import load_boston
boston_dataset = load_boston()

Let us observe some details about the dataset.

In [3]:
.. _boston_dataset:

Boston house prices dataset

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Now let us consider some of these features,which of them you think depend upon others, which may completely independent.

For example: CRIM: the per capita crime rate by town, is it controlled by anyother variable in the dataset? There may be correlation with some but the common sense suggests that it cannot depend on others variables in this dataset. However it might influence the price 'MEDV' of the house, since people may not want to stay in high crime rate localities.

In similar manner, if you think you will find that in all probability the price is the dependent variable and rest independent variables.

Let us visualize the data for confirmation. For easy visualization and processing, we make use of pandas dataframe.

In [4]:
import pandas as pd  
boston = pd.DataFrame(, columns=boston_dataset.feature_names)
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

You can see that MEDV is miising from the above dataframe. This is because it is a well known dataset, and scikit has separated it as dependent ('MEDV') and independent variables.

Let us combine it in the same dataframe, because here our aim is not doing the actual prediction, but understading the data.

In [5]:
boston['MEDV'] =
In [6]:
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Notice the addition of 'MEDV' column.

One of the common ways of knowing if a data variables are interrelated is finding the correlation.

In [7]:
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
correlation_matrix = boston.corr().round(2)
# annot = True to print the values inside the square
sns.heatmap(data=correlation_matrix, annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x125c18e80>

Ok so what you see here is the correlation between different features. But what the heck does it mean.


Correlation measures the degree of relationship between two variables. For example the presence of bees and flowers is correlated, if there are flowers, bees can be seen around, if bees are there, flowers will be near. image

The correlation may be positive (like in case of flowers and bees) or negative. It lies in the range [-1, 1].

When the two variables have a zero correlation there is no correlation between them. Below you can see the three possible correlations.


Going back to the Boston house dataset, consider the correlation between 'LSTAT' and 'MEDV' the value is -0.74, they are negatively correlated, the 'RM' and 'MEDV" have correlation of 0.7, they are positively correlated, and the correlation between 'DIS' and 'MEDV' is 0.21, almost zero. Let us plot them:

In [28]:
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20,10),sharey=True)
ax[0].set_xlabel('LSTAT', fontsize = 20.0)
ax[0].set_ylabel('MEDV',fontsize = 20.0)
ax[0].set_title("Negative Correlation", fontsize = 25.0, fontweight="bold")

ax[1].set_xlabel('RM', fontsize = 20.0)
ax[1].set_title("Positive Correlation", fontsize = 25.0, fontweight="bold")

ax[2].set_xlabel('DIS',fontsize = 20.0)
ax[2].set_title("Zero Correlation",fontsize = 25.0, fontweight="bold")
Text(0.5, 1.0, 'Zero Correlation')

Causation and correlation

Two variables are related via causation, if they have cause and effect relationship, for example if I increase the pressure on accelerator the speed of car will increase, the pressure on the accelerator is the cause and speed is the effect.

Observe it is not other way round, it is not so because the speed is increasing, I am increasing the pressure on the accelerator

A common mistake people think is assuming that if two variables are correlated, then one is the cause of other. It may be true for example flowers are the cause of bee presence (but think. can it be other way round). But not a necessity.

To prove my point see these funny correlations.

And finally would like to end this article with one of my favorite dialogue from MIB-2, a classic example of causation and corelation:


Image Flower and Bees Source: Pxhere