Correspondence Analysis

Correspondence Analysis

Introduction

An introduction to correspondence analysis can be found in the book "Exploratory Multivariate Analysis by Example Using R" by François Husson. It is worth reading and, along with the tutorial series on YouTube tutorial series on YouTube , it is a great source that can help make the basics of correspondence analysis clear to you.

If you prefer Python, then two sources to start with Correspondence Analysis (CA) are Introduction to Correspondence Analysis and Prince - Python factor analysis library.

Main Idea

When working with dataset of n items described using two categorical variables V_1 and V_2 you need to find their interconnectedness. Technically you need to caculate how far away relationship between V_1 and V_2 is from independence. The data for this analysis is prepared as a "contingency table" including sum of therms for each row (column margin) and sum of terms for each column (row margin).

Main Table Nobel Prize winners by country and categories

Chemistry Economic sciences Literature Medicin Peace Physics Total
Country
Germany 24 1 8 18 5 24 80
Canada 4 3 2 4 1 4 18
France 8 3 11 12 10 9 53
UK 23 6 7 26 11 20 93
Italy 1 1 6 5 1 5 19
Japan 6 0 2 3 1 11 23
Russia 4 3 5 2 3 10 27
US 51 43 8 70 19 66 257
Total 121 60 49 140 51 149 570

The data is pulled from Francois Hussons github and covered in the tutorial series on YouTube. It looks at the distribution of Nobel Prize wins by Category and Country. Each cell contains frequencies for particular combinations of values of two variables V_1 (Country) and V_2 (Prize cathegory) and each combination is mutually exclusive.

Why Correspondence

"Correspondence analysis" (CA) result from the fact that tables are analysed by linking two corresponding sets: those by rows and by columns. There is no better illustration that the one that F. Husson give us by comparing ther row profiles with the column profiles as percentages: Introduction to Correspondence Analysis Video 1:

In [1]:
from IPython.display import Image
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_row_profile.PNG', width = 500)
Out[1]:
In [2]:
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_col_profile.PNG', width = 600)
Out[2]:

In contingency table each cell holds a number of individuals posessing the category i of V1 and j of V2 (see image below). The sums of the table are the margins. As you might have noticed it is hard and sometimes impossible to distinguish differences in frequencies of combinations by just looking at the data. For that Correspondence Analysis uses $\chi^2$ statistic. We continue with the third example of Nobel Prize data. Examine and visualize relationship/its absence with the variables in dataset in this example is as usual not easy.

In [3]:
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_contingency.png', width = 600)
Out[3]:

Recipe: take one way cross-table, ask a question, compute conditional probability, apply $\chi^2$ test, etc.

For Noble prize data $\chi^2$ help to find differences in the winnings by country in each Nobel Price category. The $\chi^2$ test is about saying how significantly the distribution of categories differs from one country to another.

Question to answer:

  • is there a relationship between country and type of Nobel prize: are some countries tend to get prize of certain disciplines/ are prize cathegories more likely to go to one country than another)?

The image above shows how contingency table is built. The sums of the table along rows and across columns (margins) are denoted by $x_{i•}$ , $x_{•j} $ or $x_{••} $ depending on dimension(s) on which calculation is carried: $ x_{i•} = \sum \limits _{j=1} ^{J} x_{ij} $, $ x_{•j} = \sum \limits _{i=1} ^{I} x_{ij} $, $ n = x_{••} = \sum \limits _{i,j} x_{ij} $.

Conditional Probability. Probability Tables

In correspondence analysis, we calculate conditional probability table. In CA studied samples are considered without inferences but as a population at whole, so term probability is justified regardless of the fact that it refers to an amount determined from a sample. Table of probabilities calculated by dividing the values in each cell by n and denoted by notation $f_{ij}$:

$f_{ij} = x_{ij}/n $

As the general term $f_{ij}$ , is the probability of being in the category i of V1 knowing that we are in j of V2. The marginal probabilities (marginal column and row) of this table are computed as follows:

$ f_{i•} = \sum \limits _{j=1} ^{J} f_{ij} $, $ f_{•j} = \sum \limits _{i=1} ^{I} f_{ij} $, $ f_{••} = \sum \limits _{i,j} f_{ij} $.

$ f_{••} $ is always equals 1 as the true probability distribution.

In [4]:
Image(r'C:\thisAKcode.github.io\Pelican\content\images\ca_probabilities.PNG', width = 600)
Out[4]:

Independence Model

In most tables using CA, we are sure that relationships between variables exists. Conducting $ \chi^2 $ test aims to evaluate the strength of the relationship. If we start from the absence of relationship then we satisfy criterion for indepenende model. The independence of two events looks as follows:

$P[A and B]=P[A]P[B] $ or same in another notation: $P(A, B) = P(A)P(B) $.

Independence model is applicable to the categorical variables in a contingency table. Two events are independent if the probability that they both happen is the product of the probabilities that each one happens. Two categorical variables are considered independent if they verify:
$ ∀_{i}, ∀_{j}, f_{ij} = f_{i•}f_{•j} $

In accordance to the independence model the joint probability $f_{ij}$ is dependent on the marginal probabilities $ (f_{i•} and f_{•j}) $ alone:

$\frac{f_{ij}}{f_{i•}} = f_{•j} $

$\frac{f_{ij}}{f_{•j}} = f_{i•} $

To illustrate this we compare the actual sample sizes $(x_{ij} = nf_{ij})$ and the theoretical sample sizes characterised by the independence model $(nf_{i•}f_{•j}) $

Python Code For CA instead of math

Before we spent som time to look at math notation, here is the Python-code instead of it.

The probability matrix as well as the marginal probabilities for Nobel Prize data in Python are computed as follows. Code is pulled from this article Introduction to Correspondence Analysis:

In [5]:
import pandas as pd
import numpy as np
data_ca = r'C:\thisAKcode.github.io\Pelican\content\other\CSV.csv'

df = pd.read_csv(data_ca, encoding='utf-8')
crosstab_orig = df.iloc[1:9, 1:-1].to_numpy()
countries = df.iloc[1:-1, 0].tolist()
prize_categories = df.columns[1:-1].tolist()



# x.j
row_totals = np.sum(crosstab_orig, axis=1)
# xi.
column_totals = np.sum(crosstab_orig, axis=0)
# x..
grand_total = np.sum(crosstab_orig)

proba_matrix = np.divide(crosstab_orig,grand_total)
proba_matrixX = pd.DataFrame(
                 data=
                     proba_matrix,
                     columns=pd.Series(prize_categories),
                     index=pd.Series(countries))
proba_grand_total = np.sum(proba_matrix)  # equals one 

# f.j margin row profile holds the sums at the bottom rowp  
margin_row = np.sum(proba_matrix, axis=0)
# fi. margin column profile holds the sums placed at the right 
margin_col = np.sum(proba_matrix, axis=1)

independence_model = np.outer(margin_col, margin_row)  # fi.f.j
inde_dot = np.dot(grand_total, independence_model).round(decimals =0)  # nfi.f.j where grand_total is n
independenceX = pd.DataFrame(
                     data=independence_model,
                     columns=pd.Series(prize_categories),
                     index=pd.Series(countries)
                            )
In [48]:
# Here I would like to quickly show how to plot cross table

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd_ = crosstab_orig.apply(lambda x: x/sum(x)*100),2
cumulative_percent = np.cumsum(pd_, axis=0)
cumulative_percent
Out[48]:
array([   Chemistry  Economic sciences  Literature    Medicin      Peace    Physics
       1  19.834711           1.666667   16.326531  12.857143   9.803922  16.107383
       2   3.305785           5.000000    4.081633   2.857143   1.960784   2.684564
       3   6.611570           5.000000   22.448980   8.571429  19.607843   6.040268
       4  19.008264          10.000000   14.285714  18.571429  21.568627  13.422819
       5   0.826446           1.666667   12.244898   3.571429   1.960784   3.355705
       6   4.958678           0.000000    4.081633   2.142857   1.960784   7.382550
       7   3.305785           5.000000   10.204082   1.428571   5.882353   6.711409
       8  42.148760          71.666667   16.326531  50.000000  37.254902  44.295302,
          Chemistry  Economic sciences  Literature    Medicin      Peace    Physics
       1  21.834711           3.666667   18.326531  14.857143  11.803922  18.107383
       2   5.305785           7.000000    6.081633   4.857143   3.960784   4.684564
       3   8.611570           7.000000   24.448980  10.571429  21.607843   8.040268
       4  21.008264          12.000000   16.285714  20.571429  23.568627  15.422819
       5   2.826446           3.666667   14.244898   5.571429   3.960784   5.355705
       6   6.958678           2.000000    6.081633   4.142857   3.960784   9.382550
       7   5.305785           7.000000   12.204082   3.428571   7.882353   8.711409
       8  44.148760          73.666667   18.326531  52.000000  39.254902  46.295302],
      dtype=object)

The same parameters to deal with CA comes right out of the box if you use scipy. For instance you can see how you can get data as per independence model.

In [26]:
from scipy.stats import chi2_contingency as chi2
statistic, prob, dof, indep_sample = chi2(crosstab_orig)

indep_sample
Out[26]:
array([[16.98245614,  8.42105263,  6.87719298, 19.64912281,  7.15789474,
        20.9122807 ],
       [ 3.82105263,  1.89473684,  1.54736842,  4.42105263,  1.61052632,
         4.70526316],
       [11.25087719,  5.57894737,  4.55614035, 13.01754386,  4.74210526,
        13.85438596],
       [19.74210526,  9.78947368,  7.99473684, 22.84210526,  8.32105263,
        24.31052632],
       [ 4.03333333,  2.        ,  1.63333333,  4.66666667,  1.7       ,
         4.96666667],
       [ 4.88245614,  2.42105263,  1.97719298,  5.64912281,  2.05789474,
         6.0122807 ],
       [ 5.73157895,  2.84210526,  2.32105263,  6.63157895,  2.41578947,
         7.05789474],
       [54.55614035, 27.05263158, 22.09298246, 63.12280702, 22.99473684,
        67.18070175]])

Use prince to do CA

Prince is well known library to do factor analysis. https://pypi.org/project/prince/. Below you can see how the Nobel Prize dataset analysed using prince.

In [27]:
import pandas as pd
import prince

pd.set_option('display.float_format', lambda x: '{:.6f}'.format(x))


df = pd.read_csv(data_ca, encoding='utf-8')
crosstab_orig = df.iloc[1:9, 1:-1].to_numpy()
countries = df.iloc[1:-1, 0].tolist()
prize_categories = df.columns[1:-1].tolist()

# continue from here https://pypi.org/project/prince/     Correspondence analysis (CA)
X = pd.DataFrame(data=crosstab_orig,
                 columns=pd.Series(prize_categories),
                 index=pd.Series(countries))

Zoom in to independence model

Below you can see how the input data looks like compared to values calculated for independence model. First your can look at the actual input data.

In [28]:
X.astype(int)
Out[28]:
Chemistry Economic sciences Literature Medicin Peace Physics
Germany 24 1 8 18 5 24
Canada 4 3 2 4 1 4
France 8 3 11 12 10 9
UK 23 6 7 26 11 20
Italy 1 1 6 5 1 5
Japan 6 0 2 3 1 11
Russia 4 3 5 2 3 10
US 51 43 8 70 19 66

Below the hypothetical values as per independence model make it possible to see distribution of prizes that not differs from one country to another. Here probabilities were calculated as product of marginal probabilities.

In [7]:
(independenceX*grand_total).astype(int)
Out[7]:
Chemistry Economic sciences Literature Medicin Peace Physics
Germany 16 8 6 19 7 20
Canada 3 1 1 4 1 4
France 11 5 4 13 4 13
UK 19 9 7 22 8 24
Italy 4 2 1 4 1 4
Japan 4 2 1 5 2 6
Russia 5 2 2 6 2 7
US 54 27 22 63 22 67

We still face the problem of being unable to spot difference between those two tables is there reasonable difference between observed values and independence model? That's where CA recurs to chi squared test.

Use Chi-square to detect a deviation from independence

Prince comes in

In [8]:
ca = prince.CA(n_components=2,
     n_iter=3,
     copy=True,
     check_input=True,
     engine='auto',
     random_state=42)
X.columns.rename('Category', inplace=True)
X.index.rename('Country', inplace=True)
ca = ca.fit(X)
In [9]:
ca.row_coordinates(X)
Out[9]:
0 1
Germany -0.158403 0.319637
Canada 0.047235 -0.113463
France -0.498328 -0.273206
UK -0.035672 0.040835
Italy -0.712666 -0.265207
Japan -0.175512 0.515635
Russia -0.356149 -0.048568
US 0.267488 -0.071423
In [10]:
ca.column_coordinates(X)
Out[10]:
0 1
Chemistry 0.058167 0.211959
Economic sciences 0.461605 -0.351199
Literature -0.789820 -0.185986
Medicin 0.107692 -0.066187
Peace -0.203421 -0.207777
Physics -0.004938 0.163765
In [11]:
ax = ca.plot_coordinates(
    X=X,
     ax=None,
     figsize=(6, 6),
     x_component=0,
     y_component=1,
     show_row_labels=True,
     show_col_labels=True
 )
In [12]:
ca.eigenvalues_

ca.total_inertia_


ca.explained_inertia_
Out[12]:
[0.5474785470374058, 0.24599753837855615]
In [35]:
len(ca.eigenvalues_)
Out[35]:
2
In [13]:
# df = df.reset_index()  # make sure indexes pair with number of rows
# for index, row in df.iterrows():
#     print('pin', index, row)

Core of CA

Now whe the independence model is computed we are able to put it in front of observed data. In independence model values in rows are distributed proportionally to the total and so are column values.

A Few Other Examples of Input Data

A few examples of contingency tables that you can use for applying correspondence analysis on may be found here.

Table1 Hair color and eye color

Hair Color Fair Red Medium Dark Black Total
Eye Color
Light 688 116 584 188 4 1580
Blue 326 38 241 110 3 718
Medium 343 84 909 412 26 1774
Dark 98 48 403 681 85 1315
Total 1455 286 2137 1391 118 5387

You see school children classified according to the two discrete variables, eye color and hair color. This dataset is from web.stanford.edu: Hair Color and Eye Color

Table2 Fashion brands perception

brand luxurious traditional intellectual brilliant calm youthful friendly simple energetic Total
Chanel 449 252 106 236 61 13 8 29 16 1170
Louis Vuitton 410 286 83 142 80 18 20 48 31 1118
Christian Dior 356 200 95 206 67 19 18 27 9 997
Tiffany 362 219 103 187 59 55 36 35 10 1066
Rolex 442 248 114 89 109 4 9 52 12 1079
Burberry 287 287 143 42 199 29 67 124 9 1187
Ralph Lauren 198 191 101 39 147 61 70 100 9 916
Benetton 86 62 31 88 35 216 97 65 21 701
Uniqlo 6 7 10 8 23 260 331 199 291 1135
H&M 8 5 10 2 10 272 132 91 223 753
GAP 10 10 10 9 24 275 203 137 84 762
Total 2614 1767 806 1048 814 1222 991 907 715 10884

You see brand perception classified according to the two qualitative variables, brand and assosiated impression. This dataset is from private github repo by okomestudio: Fashion brands

Sources

In [ ]:
 

links

social