Chi-Square Test

In [3]:
from IPython.display import Image
Image(r'C:\thisAKcode.github.io\Pelican\content\images\chi_health0.png',  width = 600)
Out[3]:

Chi-Square Test

Sal Khan’s tutorial on YouTube demonstrates how to check whether two herbs are really helping a group of people to prevent the flu and helps me to understand Chi-Squared test.

In the illustration above you can see 1 stands for herb one, 2 herb2, _ is the placebo. Red/green colors represents healthy/seek individuals.

Use Chi-Squared test if you have n discrete classes put into contingency table. Now I only have a bunch of individuals and I can create a contingency table by grouping the data by the Herb and Sickness Status, and counting the number of patients in each group.

| treatement | Herb1 | Herb2 | Placebo | Total |
|------------|-------|-------|---------|-------|
| outcome    |       |       |         |       |
| sick       | 20    | 30    | 30      | 80    |
| not sick   | 100   | 110   | 90      | 300   |
| Total      | 120   | 140   | 120     | 380   |

Can you see any difference in the effect of herbs on protection against the flu?

As per Null Hypothesis, denoted by H0 we assume that the herbs does nothing. In our example, the null hypothesis would be that healthiness is at least as common while not taking any herbs as with herbs.

If H0 holds there just happen to be a lot of sick people that took herbs or not sick taking herbs in the study.

In contrast, the hypothesis under which our belief on effect of Herb is true is known as the alternative hypothesis, denoted by H1. Here H1 is that consuming herbs does increase/decrease a likelihood of getting the flu.

Significance level

Probability of rejecting a H0 while it is true has upper bound which is known as a significance level. We care about level 5% or above we do not want to reject H0 if we do not have to.

Hands on chi-square statistic for the contingency table

We assumed the null hypothesis and figured out what the expected value would have been. To illustrate the chi squared statistic using a dataset of herbs, we will first need to import the necessary libraries and load in our data.

In [4]:
import scipy
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
data = r'C:\thisAKcode.github.io\Pelican\content\other\herbs.csv'

df = pd.read_csv(data, encoding='utf-8')
df
Out[4]:
treatement Herb1 Herb2 Placebo Total
0 outcome NaN NaN NaN NaN
1 sick 20.0 30.0 30.0 80.0
2 not sick 100.0 110.0 90.0 300.0
3 Total 120.0 140.0 120.0 380.0

You see the table with the counts of sick/healthy outcomes for each herb as well as margins. We can then use this table as input to the chi2_contingency function to perform the chi squared test but before let's see how it can be computed from scratch. The code is almost completely unchanged pulled from Introduction to Correspondence Analysis.

In [17]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

#'slice' a df to only get counts without margins
subdf = df.iloc[1:3, 1:4]
np_sub = np.array(subdf)
gt = np.sum(np_sub) # grand total
corresp_mtx = np.divide(np_sub,gt)
row_tot = np.sum(corresp_mtx, axis=1)  # right marginal
column_tot = np.sum(corresp_mtx, axis=0)  # left marginal
independence_model = np.outer(row_tot, column_tot)  # out[i, j] = a[i] * b[j]
# [a[i]* b[j]]
inde_dot = np.dot(gt, independence_model).round(decimals =1)
answer = ['herb1', 'herb2', 'placebo']
actual_role= ['sick', 'not sick']
df = pd.DataFrame(data=np_sub,
                 columns=answer,
                 index=actual_role)
In [31]:
np_sub, inde_dot
Out[31]:
(array([[ 20.,  30.,  30.],
        [100., 110.,  90.]]), array([[ 25.3,  29.5,  25.3],
        [ 94.7, 110.5,  94.7]]))
In [32]:
#  invoke a clean scipy.stats function
from scipy.stats import chisquare, chi2_contingency

statistic, prob, dof, ex = chi2_contingency(np_sub)
# chisquare(f_obs=np_sub, f_exp=inde_dot)
statistic, prob, dof, ex
Out[32]:
(2.5257936507936507,
 0.2828335193186947,
 2,
 array([[ 25.26315789,  29.47368421,  25.26315789],
        [ 94.73684211, 110.52631579,  94.73684211]]))

Chi-Squared Formula

Chi square test of independence allow in a relatively certain way check whether the distribution within categories of V1 (health status) differs from one category to another of variable V2 (herbs).

$$\chi^2 = \sum \frac {(O - E)^2}{E}$$

O and E here stands for observed/expected values where expected values are expected as per independence model.

The resulting output will be the chi squared statistic, the p-value, and the degrees of freedom. If the p-value is below a certain threshold (typically 0.05 and also here), we can conclude that there is a significant difference between the observed and expected counts, indicating that the herb has a significant effect on the sickness status.

In [ ]:
 

links

social