Correspondence Analysis¶

Introduction¶

An introduction to correspondence analysis can be found in the book "Exploratory Multivariate Analysis by Example Using R" by François Husson. It is worth reading and, along with the tutorial series on YouTube tutorial series on YouTube , it is a great source that can help make the basics of correspondence analysis clear to you.

If you prefer Python, then two sources to start with Correspondence Analysis (CA) are Introduction to Correspondence Analysis and Prince - Python factor analysis library.

Main Idea¶

When working with dataset of n items described using two categorical variables V_1 and V_2 you need to find their interconnectedness. Technically you need to caculate how far away relationship between V_1 and V_2 is from independence. The data for this analysis is prepared as a "contingency table" including sum of therms for each row (column margin) and sum of terms for each column (row margin).

Main Table Nobel Prize winners by country and categories

	Chemistry	Economic sciences	Literature	Medicin	Peace	Physics	Total
Country
Germany	24	1	8	18	5	24	80
Canada	4	3	2	4	1	4	18
France	8	3	11	12	10	9	53
UK	23	6	7	26	11	20	93
Italy	1	1	6	5	1	5	19
Japan	6	0	2	3	1	11	23
Russia	4	3	5	2	3	10	27
US	51	43	8	70	19	66	257
Total	121	60	49	140	51	149	570

Materials published by François Husson (french)

The data is pulled from Francois Hussons github and covered in the tutorial series on YouTube. It looks at the distribution of Nobel Prize wins by Category and Country. Each cell contains frequencies for particular combinations of values of two variables V_1 (Country) and V_2 (Prize cathegory) and each combination is mutually exclusive.

Why Correspondence¶

"Correspondence analysis" (CA) result from the fact that tables are analysed by linking two corresponding sets: those by rows and by columns. There is no better illustration that the one that F. Husson give us by comparing ther row profiles with the column profiles as percentages: Introduction to Correspondence Analysis Video 1:

In [1]:

from IPython.display import Image
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_row_profile.PNG', width = 500)

Out[1]:

In [2]:

Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_col_profile.PNG', width = 600)

Out[2]:

In contingency table each cell holds a number of individuals posessing the category i of V1 and j of V2 (see image below). The sums of the table are the margins. As you might have noticed it is hard and sometimes impossible to distinguish differences in frequencies of combinations by just looking at the data. For that Correspondence Analysis uses $\chi^2$ statistic. We continue with the third example of Nobel Prize data. Examine and visualize relationship/its absence with the variables in dataset in this example is as usual not easy.

In [3]:

Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_contingency.png', width = 600)

Out[3]:

Recipe: take one way cross-table, ask a question, compute conditional probability, apply $\chi^2$ test, etc.¶

For Noble prize data $\chi^2$ help to find differences in the winnings by country in each Nobel Price category. The $\chi^2$ test is about saying how significantly the distribution of categories differs from one country to another.

Question to answer:

is there a relationship between country and type of Nobel prize: are some countries tend to get prize of certain disciplines/ are prize cathegories more likely to go to one country than another)?

The image above shows how contingency table is built. The sums of the table along rows and across columns (margins) are denoted by $x_{i•}$ , $x_{•j} $ or $x_{••} $ depending on dimension(s) on which calculation is carried: $ x_{i•} = \sum \limits _{j=1} ^{J} x_{ij} $, $ x_{•j} = \sum \limits _{i=1} ^{I} x_{ij} $, $ n = x_{••} = \sum \limits _{i,j} x_{ij} $.

Conditional Probability. Probability Tables¶

In correspondence analysis, we calculate conditional probability table. In CA studied samples are considered without inferences but as a population at whole, so term probability is justified regardless of the fact that it refers to an amount determined from a sample. Table of probabilities calculated by dividing the values in each cell by n and denoted by notation $f_{ij}$:

$f_{ij} = x_{ij}/n $

As the general term $f_{ij}$ , is the probability of being in the category i of V1 knowing that we are in j of V2. The marginal probabilities (marginal column and row) of this table are computed as follows:

$ f_{i•} = \sum \limits _{j=1} ^{J} f_{ij} $, $ f_{•j} = \sum \limits _{i=1} ^{I} f_{ij} $, $ f_{••} = \sum \limits _{i,j} f_{ij} $.

$ f_{••} $ is always equals 1 as the true probability distribution.

In [4]:

Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_probabilities.PNG', width = 600)

Out[4]:

Independence Model¶

In most tables using CA, we are sure that relationships between variables exists. Conducting $ \chi^2 $ test aims to evaluate the strength of the relationship. If we start from the absence of relationship then we satisfy criterion for indepenende model. The independence of two events looks as follows:

$P[A and B]=P[A]P[B] $ or same in another notation: $P(A, B) = P(A)P(B) $.

Independence model is applicable to the categorical variables in a contingency table. Two events are independent if the probability that they both happen is the product of the probabilities that each one happens. Two categorical variables are considered independent if they verify:
$ ∀_{i}, ∀_{j}, f_{ij} = f_{i•}f_{•j} $

In accordance to the independence model the joint probability $f_{ij}$ is dependent on the marginal probabilities $ (f_{i•} and f_{•j}) $ alone:

$\frac{f_{ij}}{f_{i•}} = f_{•j} $

$\frac{f_{ij}}{f_{•j}} = f_{i•} $

To illustrate this we compare the actual sample sizes $(x_{ij} = nf_{ij})$ and the theoretical sample sizes characterised by the independence model $(nf_{i•}f_{•j}) $

Python Code For CA instead of math¶

Before we spent som time to look at math notation, here is the Python-code instead of it.

The probability matrix as well as the marginal probabilities for Nobel Prize data in Python are computed as follows. Code is pulled from this article Introduction to Correspondence Analysis:

In [5]:

import pandas as pd
import numpy as np
data_ca = r'C:\thisAKcode.github.io\Pelican\content\other\CSV.csv'

df = pd.read_csv(data_ca, encoding='utf-8')
crosstab_orig = df.iloc[1:9, 1:-1].to_numpy()
countries = df.iloc[1:-1, 0].tolist()
prize_categories = df.columns[1:-1].tolist()



# x.j
row_totals = np.sum(crosstab_orig, axis=1)
# xi.
column_totals = np.sum(crosstab_orig, axis=0)
# x..
grand_total = np.sum(crosstab_orig)

proba_matrix = np.divide(crosstab_orig,grand_total)
proba_matrixX = pd.DataFrame(
                 data=
                     proba_matrix,
                     columns=pd.Series(prize_categories),
                     index=pd.Series(countries))
proba_grand_total = np.sum(proba_matrix)  # equals one 

# f.j margin row profile holds the sums at the bottom rowp  
margin_row = np.sum(proba_matrix, axis=0)
# fi. margin column profile holds the sums placed at the right 
margin_col = np.sum(proba_matrix, axis=1)

independence_model = np.outer(margin_col, margin_row)  # fi.f.j
inde_dot = np.dot(grand_total, independence_model).round(decimals =0)  # nfi.f.j where grand_total is n
independenceX = pd.DataFrame(
                     data=independence_model,
                     columns=pd.Series(prize_categories),
                     index=pd.Series(countries)
                            )

In [48]:

# Here I would like to quickly show how to plot cross table

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd_ = crosstab_orig.apply(lambda x: x/sum(x)*100),2
cumulative_percent = np.cumsum(pd_, axis=0)
cumulative_percent

Out[48]:

array([   Chemistry  Economic sciences  Literature    Medicin      Peace    Physics
       1  19.834711           1.666667   16.326531  12.857143   9.803922  16.107383
       2   3.305785           5.000000    4.081633   2.857143   1.960784   2.684564
       3   6.611570           5.000000   22.448980   8.571429  19.607843   6.040268
       4  19.008264          10.000000   14.285714  18.571429  21.568627  13.422819
       5   0.826446           1.666667   12.244898   3.571429   1.960784   3.355705
       6   4.958678           0.000000    4.081633   2.142857   1.960784   7.382550
       7   3.305785           5.000000   10.204082   1.428571   5.882353   6.711409
       8  42.148760          71.666667   16.326531  50.000000  37.254902  44.295302,
          Chemistry  Economic sciences  Literature    Medicin      Peace    Physics
       1  21.834711           3.666667   18.326531  14.857143  11.803922  18.107383
       2   5.305785           7.000000    6.081633   4.857143   3.960784   4.684564
       3   8.611570           7.000000   24.448980  10.571429  21.607843   8.040268
       4  21.008264          12.000000   16.285714  20.571429  23.568627  15.422819
       5   2.826446           3.666667   14.244898   5.571429   3.960784   5.355705
       6   6.958678           2.000000    6.081633   4.142857   3.960784   9.382550
       7   5.305785           7.000000   12.204082   3.428571   7.882353   8.711409
       8  44.148760          73.666667   18.326531  52.000000  39.254902  46.295302],
      dtype=object)

The same parameters to deal with CA comes right out of the box if you use scipy. For instance you can see how you can get data as per independence model.

In [26]:

from scipy.stats import chi2_contingency as chi2
statistic, prob, dof, indep_sample = chi2(crosstab_orig)

indep_sample

Out[26]:

array([[16.98245614,  8.42105263,  6.87719298, 19.64912281,  7.15789474,
        20.9122807 ],
       [ 3.82105263,  1.89473684,  1.54736842,  4.42105263,  1.61052632,
         4.70526316],
       [11.25087719,  5.57894737,  4.55614035, 13.01754386,  4.74210526,
        13.85438596],
       [19.74210526,  9.78947368,  7.99473684, 22.84210526,  8.32105263,
        24.31052632],
       [ 4.03333333,  2.        ,  1.63333333,  4.66666667,  1.7       ,
         4.96666667],
       [ 4.88245614,  2.42105263,  1.97719298,  5.64912281,  2.05789474,
         6.0122807 ],
       [ 5.73157895,  2.84210526,  2.32105263,  6.63157895,  2.41578947,
         7.05789474],
       [54.55614035, 27.05263158, 22.09298246, 63.12280702, 22.99473684,
        67.18070175]])

Use prince to do CA¶

Prince is well known library to do factor analysis. https://pypi.org/project/prince/. Below you can see how the Nobel Prize dataset analysed using prince.

In [27]:

import pandas as pd
import prince

pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))


df = pd.read_csv(data_ca, encoding='utf-8')
crosstab_orig = df.iloc[1:9, 1:-1].to_numpy()
countries = df.iloc[1:-1, 0].tolist()
prize_categories = df.columns[1:-1].tolist()

# continue from here https://pypi.org/project/prince/     Correspondence analysis (CA)
X = pd.DataFrame(data=crosstab_orig,
                 columns=pd.Series(prize_categories),
                 index=pd.Series(countries))

Zoom in to independence model¶

Below you can see how the input data looks like compared to values calculated for independence model. First your can look at the actual input data.

In [28]:

X.astype(int)

Out[28]:

	Chemistry	Economic sciences	Literature	Medicin	Peace	Physics
Germany	24	1	8	18	5	24
Canada	4	3	2	4	1	4
France	8	3	11	12	10	9
UK	23	6	7	26	11	20
Italy	1	1	6	5	1	5
Japan	6	0	2	3	1	11
Russia	4	3	5	2	3	10
US	51	43	8	70	19	66

Below the hypothetical values as per independence model make it possible to see distribution of prizes that not differs from one country to another. Here probabilities were calculated as product of marginal probabilities.

In [7]:

(independenceX*grand_total).astype(int)

Out[7]:

	Chemistry	Economic sciences	Literature	Medicin	Peace	Physics
Germany	16	8	6	19	7	20
Canada	3	1	1	4	1	4
France	11	5	4	13	4	13
UK	19	9	7	22	8	24
Italy	4	2	1	4	1	4
Japan	4	2	1	5	2	6
Russia	5	2	2	6	2	7
US	54	27	22	63	22	67

We still face the problem of being unable to spot difference between those two tables is there reasonable difference between observed values and independence model? That's where CA recurs to chi squared test.

Use Chi-square to detect a deviation from independence¶

Prince comes in¶

In [8]:

ca = prince.CA(n_components=2,
     n_iter=3,
     copy=True,
     check_input=True,
     engine='auto',
     random_state=42)
X.columns.rename('Category', inplace=True)
X.index.rename('Country', inplace=True)
ca = ca.fit(X)

In [9]:

ca.row_coordinates(X)

Out[9]:

	0	1
Germany	-0.158403	0.319637
Canada	0.047235	-0.113463
France	-0.498328	-0.273206
UK	-0.035672	0.040835
Italy	-0.712666	-0.265207
Japan	-0.175512	0.515635
Russia	-0.356149	-0.048568
US	0.267488	-0.071423

In [10]:

ca.column_coordinates(X)

Out[10]:

	0	1
Chemistry	0.058167	0.211959
Economic sciences	0.461605	-0.351199
Literature	-0.789820	-0.185986
Medicin	0.107692	-0.066187
Peace	-0.203421	-0.207777
Physics	-0.004938	0.163765

In [11]:

ax = ca.plot_coordinates(
    X=X,
     ax=None,
     figsize=(6, 6),
     x_component=0,
     y_component=1,
     show_row_labels=True,
     show_col_labels=True
 )

In [12]:

ca.eigenvalues_

ca.total_inertia_


ca.explained_inertia_

Out[12]:

[0.5474785470374058, 0.24599753837855615]

In [35]:

len(ca.eigenvalues_)

Out[35]:

In [13]:

# df = df.reset_index()  # make sure indexes pair with number of rows
# for index, row in df.iterrows():
#     print('pin', index, row)

Core of CA¶

Now whe the independence model is computed we are able to put it in front of observed data. In independence model values in rows are distributed proportionally to the total and so are column values.

A Few Other Examples of Input Data¶

A few examples of contingency tables that you can use for applying correspondence analysis on may be found here.

Table1 Hair color and eye color

Hair Color	Fair	Red	Medium	Dark	Black	Total
Eye Color
Light	688	116	584	188	4	1580
Blue	326	38	241	110	3	718
Medium	343	84	909	412	26	1774
Dark	98	48	403	681	85	1315
Total	1455	286	2137	1391	118	5387

You see school children classified according to the two discrete variables, eye color and hair color. This dataset is from web.stanford.edu: Hair Color and Eye Color

Table2 Fashion brands perception

brand	luxurious	traditional	intellectual	brilliant	calm	youthful	friendly	simple	energetic	Total
Chanel	449	252	106	236	61	13	8	29	16	1170
Louis Vuitton	410	286	83	142	80	18	20	48	31	1118
Christian Dior	356	200	95	206	67	19	18	27	9	997
Tiffany	362	219	103	187	59	55	36	35	10	1066
Rolex	442	248	114	89	109	4	9	52	12	1079
Burberry	287	287	143	42	199	29	67	124	9	1187
Ralph Lauren	198	191	101	39	147	61	70	100	9	916
Benetton	86	62	31	88	35	216	97	65	21	701
Uniqlo	6	7	10	8	23	260	331	199	291	1135
H&M	8	5	10	2	10	272	132	91	223	753
GAP	10	10	10	9	24	275	203	137	84	762
Total	2614	1767	806	1048	814	1222	991	907	715	10884

You see brand perception classified according to the two qualitative variables, brand and assosiated impression. This dataset is from private github repo by okomestudio: Fashion brands

Sources¶

Husson, Francois; Le, Sebastien; Pagès, Jérôme. Exploratory Multivariate Analysis by Example Using R (Chapman & Hall/CRC Computer Science & Data Analysis) (p. iii). CRC Press. Kindle Edition.
Izenman,Alan Julian; Modern Multivariate Statistical Techniques Regression, Classification,and Manifold Learning source
https://github.com/MaxHalford/prince
https://statisticsbyjim.com/probability/contingency-tables-probabilities/

In [ ]: