Correspondence Analysis¶
Introduction¶
An introduction to correspondence analysis can be found in the book "Exploratory Multivariate Analysis by Example Using R" by François Husson. It is worth reading and, along with the tutorial series on YouTube tutorial series on YouTube , it is a great source that can help make the basics of correspondence analysis clear to you.
If you prefer Python, then two sources to start with Correspondence Analysis (CA) are Introduction to Correspondence Analysis and Prince - Python factor analysis library.
Main Idea¶
When working with dataset of n items described using two categorical variables V_1
and V_2
you need to find their interconnectedness. Technically you need to caculate how far away relationship between V_1
and V_2
is from independence. The data for this analysis is prepared as a "contingency table" including sum of therms for each row (column margin) and sum of terms for each column (row margin).
Main Table Nobel Prize winners by country and categories
Chemistry | Economic sciences | Literature | Medicin | Peace | Physics | Total | |
---|---|---|---|---|---|---|---|
Country | |||||||
Germany | 24 | 1 | 8 | 18 | 5 | 24 | 80 |
Canada | 4 | 3 | 2 | 4 | 1 | 4 | 18 |
France | 8 | 3 | 11 | 12 | 10 | 9 | 53 |
UK | 23 | 6 | 7 | 26 | 11 | 20 | 93 |
Italy | 1 | 1 | 6 | 5 | 1 | 5 | 19 |
Japan | 6 | 0 | 2 | 3 | 1 | 11 | 23 |
Russia | 4 | 3 | 5 | 2 | 3 | 10 | 27 |
US | 51 | 43 | 8 | 70 | 19 | 66 | 257 |
Total | 121 | 60 | 49 | 140 | 51 | 149 | 570 |
The data is pulled from Francois Hussons github and covered in the tutorial series on YouTube. It looks at the distribution of Nobel Prize wins by Category and Country. Each cell contains frequencies for particular combinations of values of two variables V_1
(Country) and V_2
(Prize cathegory) and each combination is mutually exclusive.
Why Correspondence¶
"Correspondence analysis" (CA) result from the fact that tables are analysed by linking two corresponding sets: those by rows and by columns. There is no better illustration that the one that F. Husson give us by comparing ther row profiles with the column profiles as percentages: Introduction to Correspondence Analysis Video 1:
from IPython.display import Image
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_row_profile.PNG', width = 500)
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_col_profile.PNG', width = 600)
In contingency table each cell holds a number of individuals posessing the category i of V1 and j of V2 (see image below). The sums of the table are the margins. As you might have noticed it is hard and sometimes impossible to distinguish differences in frequencies of combinations by just looking at the data. For that Correspondence Analysis uses $\chi^2$ statistic. We continue with the third example of Nobel Prize data. Examine and visualize relationship/its absence with the variables in dataset in this example is as usual not easy.
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_contingency.png', width = 600)
Recipe: take one way cross-table, ask a question, compute conditional probability, apply $\chi^2$ test, etc.¶
For Noble prize data $\chi^2$ help to find differences in the winnings by country in each Nobel Price category. The $\chi^2$ test is about saying how significantly the distribution of categories differs from one country to another.
Question to answer:
- is there a relationship between country and type of Nobel prize: are some countries tend to get prize of certain disciplines/ are prize cathegories more likely to go to one country than another)?
The image above shows how contingency table is built. The sums of the table along rows and across columns (margins) are denoted by $x_{i•}$ , $x_{•j} $ or $x_{••} $ depending on dimension(s) on which calculation is carried: $ x_{i•} = \sum \limits _{j=1} ^{J} x_{ij} $, $ x_{•j} = \sum \limits _{i=1} ^{I} x_{ij} $, $ n = x_{••} = \sum \limits _{i,j} x_{ij} $.
Conditional Probability. Probability Tables¶
In correspondence analysis, we calculate conditional probability table. In CA studied samples are considered without inferences but as a population at whole, so term probability is justified regardless of the fact that it refers to an amount determined from a sample. Table of probabilities calculated by dividing the values in each cell by n and denoted by notation $f_{ij}$:
$f_{ij} = x_{ij}/n $
As the general term $f_{ij}$ , is the probability of being in the category i of V1 knowing that we are in j of V2. The marginal probabilities (marginal column and row) of this table are computed as follows:
$ f_{i•} = \sum \limits _{j=1} ^{J} f_{ij} $, $ f_{•j} = \sum \limits _{i=1} ^{I} f_{ij} $, $ f_{••} = \sum \limits _{i,j} f_{ij} $.
$ f_{••} $ is always equals 1 as the true probability distribution.
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_probabilities.PNG', width = 600)
Independence Model¶
In most tables using CA, we are sure that relationships between variables exists. Conducting $ \chi^2 $ test aims to evaluate the strength of the relationship. If we start from the absence of relationship then we satisfy criterion for indepenende model. The independence of two events looks as follows:
$P[A and B]=P[A]P[B] $ or same in another notation: $P(A, B) = P(A)P(B) $.
Independence model is applicable to the categorical variables in a contingency table. Two events are independent if the probability that they both happen is the product of the probabilities that each one happens.
Two categorical variables are considered independent if they verify:
$ ∀_{i}, ∀_{j}, f_{ij} = f_{i•}f_{•j} $
In accordance to the independence model the joint probability $f_{ij}$ is dependent on the marginal probabilities $ (f_{i•} and f_{•j}) $ alone:
$\frac{f_{ij}}{f_{i•}} = f_{•j} $
$\frac{f_{ij}}{f_{•j}} = f_{i•} $
To illustrate this we compare the actual sample sizes $(x_{ij} = nf_{ij})$ and the theoretical sample sizes characterised by the independence model $(nf_{i•}f_{•j}) $
Python Code For CA instead of math¶
Before we spent som time to look at math notation, here is the Python-code instead of it.
The probability matrix as well as the marginal probabilities for Nobel Prize data in Python are computed as follows. Code is pulled from this article Introduction to Correspondence Analysis:
import pandas as pd
import numpy as np
data_ca = r'C:\thisAKcode.github.io\Pelican\content\other\CSV.csv'
df = pd.read_csv(data_ca, encoding='utf-8')
crosstab_orig = df.iloc[1:9, 1:-1].to_numpy()
countries = df.iloc[1:-1, 0].tolist()
prize_categories = df.columns[1:-1].tolist()
# x.j
row_totals = np.sum(crosstab_orig, axis=1)
# xi.
column_totals = np.sum(crosstab_orig, axis=0)
# x..
grand_total = np.sum(crosstab_orig)
proba_matrix = np.divide(crosstab_orig,grand_total)
proba_matrixX = pd.DataFrame(
data=
proba_matrix,
columns=pd.Series(prize_categories),
index=pd.Series(countries))
proba_grand_total = np.sum(proba_matrix) # equals one
# f.j margin row profile holds the sums at the bottom rowp
margin_row = np.sum(proba_matrix, axis=0)
# fi. margin column profile holds the sums placed at the right
margin_col = np.sum(proba_matrix, axis=1)
independence_model = np.outer(margin_col, margin_row) # fi.f.j
inde_dot = np.dot(grand_total, independence_model).round(decimals =0) # nfi.f.j where grand_total is n
independenceX = pd.DataFrame(
data=independence_model,
columns=pd.Series(prize_categories),
index=pd.Series(countries)
)
# Here I would like to quickly show how to plot cross table
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd_ = crosstab_orig.apply(lambda x: x/sum(x)*100),2
cumulative_percent = np.cumsum(pd_, axis=0)
cumulative_percent
The same parameters to deal with CA comes right out of the box if you use scipy. For instance you can see how you can get data as per independence model.
from scipy.stats import chi2_contingency as chi2
statistic, prob, dof, indep_sample = chi2(crosstab_orig)
indep_sample
Use prince to do CA¶
Prince is well known library to do factor analysis. https://pypi.org/project/prince/. Below you can see how the Nobel Prize dataset analysed using prince.
import pandas as pd
import prince
pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))
df = pd.read_csv(data_ca, encoding='utf-8')
crosstab_orig = df.iloc[1:9, 1:-1].to_numpy()
countries = df.iloc[1:-1, 0].tolist()
prize_categories = df.columns[1:-1].tolist()
# continue from here https://pypi.org/project/prince/ Correspondence analysis (CA)
X = pd.DataFrame(data=crosstab_orig,
columns=pd.Series(prize_categories),
index=pd.Series(countries))
Zoom in to independence model¶
Below you can see how the input data looks like compared to values calculated for independence model. First your can look at the actual input data.
X.astype(int)
Below the hypothetical values as per independence model make it possible to see distribution of prizes that not differs from one country to another. Here probabilities were calculated as product of marginal probabilities.
(independenceX*grand_total).astype(int)
We still face the problem of being unable to spot difference between those two tables is there reasonable difference between observed values and independence model? That's where CA recurs to chi squared test.
Use Chi-square to detect a deviation from independence¶
Prince comes in¶
ca = prince.CA(n_components=2,
n_iter=3,
copy=True,
check_input=True,
engine='auto',
random_state=42)
X.columns.rename('Category', inplace=True)
X.index.rename('Country', inplace=True)
ca = ca.fit(X)
ca.row_coordinates(X)
ca.column_coordinates(X)
ax = ca.plot_coordinates(
X=X,
ax=None,
figsize=(6, 6),
x_component=0,
y_component=1,
show_row_labels=True,
show_col_labels=True
)
ca.eigenvalues_
ca.total_inertia_
ca.explained_inertia_
len(ca.eigenvalues_)
# df = df.reset_index() # make sure indexes pair with number of rows
# for index, row in df.iterrows():
# print('pin', index, row)
Core of CA¶
Now whe the independence model is computed we are able to put it in front of observed data. In independence model values in rows are distributed proportionally to the total and so are column values.
A Few Other Examples of Input Data¶
A few examples of contingency tables that you can use for applying correspondence analysis on may be found here.
Table1 Hair color and eye color
Hair Color | Fair | Red | Medium | Dark | Black | Total |
---|---|---|---|---|---|---|
Eye Color | ||||||
Light | 688 | 116 | 584 | 188 | 4 | 1580 |
Blue | 326 | 38 | 241 | 110 | 3 | 718 |
Medium | 343 | 84 | 909 | 412 | 26 | 1774 |
Dark | 98 | 48 | 403 | 681 | 85 | 1315 |
Total | 1455 | 286 | 2137 | 1391 | 118 | 5387 |
You see school children classified according to the two discrete variables, eye color and hair color. This dataset is from web.stanford.edu: Hair Color and Eye Color
Table2 Fashion brands perception
brand | luxurious | traditional | intellectual | brilliant | calm | youthful | friendly | simple | energetic | Total |
---|---|---|---|---|---|---|---|---|---|---|
Chanel | 449 | 252 | 106 | 236 | 61 | 13 | 8 | 29 | 16 | 1170 |
Louis Vuitton | 410 | 286 | 83 | 142 | 80 | 18 | 20 | 48 | 31 | 1118 |
Christian Dior | 356 | 200 | 95 | 206 | 67 | 19 | 18 | 27 | 9 | 997 |
Tiffany | 362 | 219 | 103 | 187 | 59 | 55 | 36 | 35 | 10 | 1066 |
Rolex | 442 | 248 | 114 | 89 | 109 | 4 | 9 | 52 | 12 | 1079 |
Burberry | 287 | 287 | 143 | 42 | 199 | 29 | 67 | 124 | 9 | 1187 |
Ralph Lauren | 198 | 191 | 101 | 39 | 147 | 61 | 70 | 100 | 9 | 916 |
Benetton | 86 | 62 | 31 | 88 | 35 | 216 | 97 | 65 | 21 | 701 |
Uniqlo | 6 | 7 | 10 | 8 | 23 | 260 | 331 | 199 | 291 | 1135 |
H&M | 8 | 5 | 10 | 2 | 10 | 272 | 132 | 91 | 223 | 753 |
GAP | 10 | 10 | 10 | 9 | 24 | 275 | 203 | 137 | 84 | 762 |
Total | 2614 | 1767 | 806 | 1048 | 814 | 1222 | 991 | 907 | 715 | 10884 |
You see brand perception classified according to the two qualitative variables, brand and assosiated impression. This dataset is from private github repo by okomestudio: Fashion brands
Sources¶
- Husson, Francois; Le, Sebastien; Pagès, Jérôme. Exploratory Multivariate Analysis by Example Using R (Chapman & Hall/CRC Computer Science & Data Analysis) (p. iii). CRC Press. Kindle Edition.
- Izenman,Alan Julian; Modern Multivariate Statistical Techniques Regression, Classification,and Manifold Learning source
- https://github.com/MaxHalford/prince
- https://statisticsbyjim.com/probability/contingency-tables-probabilities/