Home » Python » How to Perform a Shapiro-Wilk Test in Python

# How to Perform a Shapiro-Wilk Test in Python

The Shapiro-Wilk test is used to calculate whether a random sample of data comes from a normal distribution which is a common assumption used in many statistical tests including regression, ANOVA, t-test, etc.

Shapiro-Wilk test was proposed in 1965 by Samuel Sanford Shapiro and Martin Wilk.

It is believed to be a reliable statistical test of normality.

In this article, we will discuss how to perform a Shapiro-Wilk test in python with different examples.

## Scipy for Shapiro-Wilk test

We will be using scipy library available in python to perform Shapiro-Wilk test.

If you don’t have `scipy `package installed then use below command on windows command prompt for `scipy `library installation.

`pip install scipy`

## How to Perform Shapiro-Wilk test in Python?

`scipy `library provide scipy.stats.shapiro() function to perform Shapiro-Wilk test.

scipy.stats.shapiro(x)

where:

x : an array list containing sample data.

This function returns a test statistic and a corresponding p-value. We will determine the result by using the below decision rule.

Decision Rule:-

• If the p-value ≤ α, then we reject the null hypothesis i.e. we assume the distribution of our variable is not normal/gaussian.
• If the p-value > α, then we fail to reject the null hypothesis i.e. we assume the distribution of our variable is normal/gaussian.

where α is a significance level.

Let’s discuss the examples to perform Shapiro-Wilk test in python.

## Example 1: Shapiro-Wilk Test on Normal Data

In this example, we will generate a random normally distributed data set and perform this test to understand functionality by using the below python code.

```#import modules
import numpy as np
from scipy.stats import shapiro

# Using seed function to generate the same random number every time with the given seed value
np.random.seed(0)

#generate sample of 150 values that follow a normal distribution with mean =0 and standard deviation=1
mean1 = 0
sd1 = 1

data = np.random.normal(mean1,sd1,150)

#perform Shapiro-Wilk test
stat,p = shapiro(data)

print("The Test-Statistic and p-value are as follows:\nTest-Statistic = %.3f , p-value = %.3f"%(stat,p))```

In the above code, first we import `numpy` package to use random.randint()` `function to generate a normally distributed array.

From `scipy `library `shapiro() `function is used to perform the Shapiro-Wilk test on data. It returns the test-statistic and corresponding p-value.

Here assume significance level is 0.05 (i.e. 95% confidence intervel)

The output of the above code is shown as below

```The Test-Statistic and p-value are as follows:
Test-Statistic = 0.990 , p-value = 0.345```

Since p-value = 0.345 is greater than 0.05, then we fail to reject the null hypothesis i.e. we do not have sufficient evidence to say that sample does not come from a normal distribution.

This is already known to us as we generated the normally distributed sample using `normal() `function from `numpy `library.

Now , let’s take a look at a visual representation for the above dataset using following code.

```#import modules
import numpy as np
from scipy.stats import shapiro
import matplotlib.pyplot as plt

# Using seed function to generate the same random number every time with the given seed value
np.random.seed(0)

#generate sample of 150 values that follow a normal distribution with mean =0 and standard deviation=1
mean1 = 0
sd1 = 1
data = np.random.normal(mean1,sd1,150)

#plot the histogram
count, bins, ignored = plt.hist(data, 10)
plt.show()```

We are using `matplotlib `package in order to visually represent the histogram for the dataset.

`matplotlib.pyplot` the package is used to plot the histogram to visualize data for generated data values.

We used `hist() `function to display histogram of the samples data values.

The histogram also shows that the distribution is fairly bell-shaped with one peak in the center of the distribution, which is typical of data that is normally distributed.

## Example 2: Shapiro-Wilk Test on Non-Normal Data

In this example, we will generate a random sample dataset from the Poisson distribution and perform test by using the below python code.

```#import modules
import numpy as np
from scipy.stats import shapiro

# Using seed function to generate the same random number every time with the given seed value
np.random.seed(1)

#generate sample of 100 values that follow a Poisson Distribution with mean =6
mean1 = 6

data = np.random.poisson(mean1,100)

#perform Shapiro-Wilk test
stat,p = shapiro(data)

print("The Test-Statistic and p-value are as follows:\nTest-Statistic = %.3f , p-value = %.3f"%(stat,p))
```

In the above code, we import `numpy` package to use random.poisson()` `function to generate a Poisson distributed dataset.

From `scipy `library `shapiro()` function is used to perform the Shapiro-Wilk test on data. It returns the test-statistic and corresponding p-value.

Here assume significance level is 0.05 (i.e. 95% confidence intervel)

The output of the above code is shown as below

```The Test-Statistic and p-value are as follows:
Test-Statistic = 0.971 , p-value = 0.026```

Since p-value = 0.026 is less than 0.05, then we reject the null hypothesis i.e. we have sufficient evidence to say that sample does not come from a normal distribution.

This is already known to us as we generated the sample from Poisson Distribution using `poisson()` function from `numpy `library.

Now, let’s take a look at a visual representation for the above dataset using the following code.

```#import modules
import numpy as np
from scipy.stats import shapiro
import matplotlib.pyplot as plt

# Using seed function to generate the same random number every time with the given seed value
np.random.seed(1)

#generate sample of 100 values that follow a Poisson Distribution with mean =6
mean1 = 6

data = np.random.poisson(mean1,100)

#plot the histogram
count, bins, ignored = plt.hist(data, 30)
plt.show()```

We are using `matplotlib` package in order to visually represent the histogram for the dataset.

`matplotlib.pyplot` the package is used to plot the histogram to visualize data for generated data values.

We used `hist() `function to display histogram of the samples data values.