Home » Python » How to Find Correlation in Python(With Examples)

# How to Find Correlation in Python(With Examples)

Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. It measures the strength of linear association between two variables.

There are several types of correlation coefficients, but the most popular is  Pearson’s correlation coefficient.

The sign and absolute value of Pearson’s correlation coefficient describe the direction and the magnitude of the relationship between two variables.

It always ranges from -1 to +1. where the greater the absolute value of a correlation, the stronger will be linear relationship.

In this tutorial, we will discuss about how to calculate Correlation in python.

## pip install numpy

If you don’t have `numpy `package installed on your system, installed it using the below commands on the window system

```pip install numpy
```

## Example – Positive Correlation in Python

In python, `Numpy `library provides corrcoef() function to calculate the correlation between two variables.

`corrcoef()` a function that returns a matrix of correlations of x with x, x with y, y with x, and y with y. We’re interested in the values of correlation of x with y (so position (1, 0) or (0, 1)).

Let’s understand how to calculate the correlation between two variables with given below python code

```#import modules
import numpy as np

# Using seed function to generate the same random number every time with the same seed value
np.random.seed(4)

# Create a random array of 500 integers between 0 and 50
x = np.random.randint(0, 50, 500)

# Create the second array using first array by adding some noise
y = x + np.random.normal(0, 10, 500)

correlation = np.corrcoef(x, y)
#print the result
print("The correlation between x and y is : \n ",correlation)```

In the above example, we have created two x and y array using `numpy` library random function.

The numpy library  `corrcoef() function `accepts x and y array as input parameters and returns correlation matrix of x and y as a result.

The above code returns below output:

```//output
The correlation between x and y is :
[[1.         0.82477049]
[0.82477049 1.        ]]```

The correlation coefficient between these two variables is 0.82477, which is a `strong positive correlation`.

By default, this function returns a matrix of correlation coefficients. If we only wanted to return the correlation coefficient between the two variables, we will use the following code.

```print("The correlation coefficient  between x and y is :",np.corrcoef(x,y)[0,1])
```

The above code returns below output:

```//output
The correlation coefficient  between x and y is : 0.82477049```

Now , let’s take a look at a scatter chart for the above array using following code.

```import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

plt.scatter(x, y)
plt.show()```

The plot also shows a strong positive correlation between the variables as they are in increasing mode.

## Example – Negative Correlation in Python

Let’s understand another example of what happens to correlation if we invert the correlation such that an increase in `x` results in a decrease in `y`?

```#import modules
import numpy as np

# Using seed function to generate the same random number every time with the same seed value
np.random.seed(5)

# Create a random array of 500 integers between 0 and 50
x = np.random.randint(0, 50, 500)

# Create the second array using first array by adding some noise
y = 100 - x + np.random.normal(0, 5, 500)

correlation = np.corrcoef(x,y)[0,1]
#print the result
print("The correlation between x and y is : \n ",correlation)```

The above code returns below output:

```//output
The correlation between x and y is :  -0.9483070198223033```

The correlation coefficient between these two variables is -0.948307, which is a `strong negative correlation`.

Now , let’s take a look at a scatter chart for the above array by using the following code.

```import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

plt.scatter(x, y)
plt.show()```

The plot also shows the `strong negative correlation` between the variables as they are in decreasing mode.

## Example – No Correlation in Python

Let’s understand another example of what if there is no correlation between `x` and `y`?

```#import modules
import numpy as np

# Using seed function to generate the same random number every time with the same seed value
np.random.seed(1)

# Create a random array of 1000 integers between 0 and 50
x = np.random.randint(0, 50, 1000)

# Create the another random array of 500 integers between 0 and 50
y = np.random.randint(0, 50, 1000)

correlation = np.corrcoef(x,y)[0,1]
#print the result
print("The correlation between x and y is : \n ",correlation)```

The above code returns below output:

```//output
The correlation between x and y is :
0.004047024772834938```

The correlation coefficient between these two variables is 0.00404, which is a very small value, indicating `no correlation between these two variables`.

Now , let’s take a look at a scatter chart for the above array using following code.

```import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

plt.scatter(x, y)
plt.show()```

The plot also shows there is `no correlation between the variables`.

## Example – Find Correlation in Python Pandas

Let’s understand another example where we will calculate the correlation between several variables in a Pandas DataFrame.

For the dataframes in python,you can simply use the corr() function for the calculation of correlation.

```#import modules
import numpy as np
import pandas as pd

# Using seed function to generate the same random number every time with the same seed value
np.random.seed(1)

# Create a random DataFrame with 3 columns(X,Y,Z) and 5 rows
data = pd.DataFrame(np.random.randint(0, 10, size=(5, 3)), columns=['X', 'Y', 'Z'])

#Print the data
print("The Dataframe is as follows:\n",data)

#calculate correlation coefficients for all pairwise combinations
correlation = data.corr()

print("The Calculated Correlation matrix is as follows:\n",correlation)

```

The numpy library  `corr() function` returns correlation matrix of x and y as a result.

The above code returns below output:

```//output
The Dataframe is as follows:
X  Y  Z
0  5  8  9
1  5  0  0
2  1  7  6
3  9  2  4
4  5  2  4

The Calculated Correlation matrix is as follows:
X         Y         Z
X  1.000000 -0.506110 -0.215166
Y -0.506110  1.000000  0.927807
Z -0.215166  0.927807  1.000000```

If you want to calculate the correlation between two specific variables in the DataFrame, you can specify the variables like below

```#Correlation between X and Y column
correlation_XY = data['X'].corr(data['Y'])

#Print the results
print("The Correlation between X and Y column is : ",correlation_XY)```

In the above code, we calculate the correlation between the X and Y columns only. It returns the below result. The` correlation coefficient` between these two columns is -0.506110 which is a `negative correlation`.

```//Output
The Correlation between X and Y column is :  -0.5061102063618225```

## Conclusion

I hope you find the above article on` how to calculate Correlation in python` code useful and educational.