Introduction¶

Here I want to write a little about pandas that I play with sometimes and share some useful pandas methods. Notice the input data comes from module pydataset that comes with a bunch of popular dataset.

In [30]:

import pandas as pd
from pydataset import data
type(data)

_input= data('HairEyeColor')

Data Structures¶

There are n structured ways to manipulate data in pandas.

Pandas Series Object¶

Series is data wrapped in a 1 d array with indexed items, indices being of type int (but str also works).

Slicing works by square-bracket notation, method .values return Numpy array.

data = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

Curious application is a Series as a specialization of a Python dictionary.

population_dict = {'California': 38332521,  'Texas': 26448193,  'New York': 19651127,  'Florida': 19552860,  'Illinois': 12882135}  population = pd.Series(population_dict)  population

Pandas DataFrame Object¶

A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

Loading the Data into df¶

In this example I will manipulate data that comes from additional dependency: scikit-learn.

Inspecting the Data¶

In [31]:

from sklearn import datasets
iris = datasets.load_iris()

print(iris.data[:10])
print(iris.target)
print(iris.DESCR[:1000])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ===========

There are datasets to play with in Scikit Learn or Scipy.

from sklearn.datasets import load_boston, make_classification, load_iris

data_boston = load_boston
data_iris = load_iris

Otherwise sklearn has a utility to generate classification datasets.

from sklearn.datasets import make_classification

Common Operations on Dataframe¶

Slice df¶

Given df you can get portion of it like this:

sub_df = df[start_row:end_row][start_col:end_col]

You provide integers for indices of rows and columns.

When input looks like that:

In [35]:

_input.head(7)

Out[35]:

	Hair	Eye	Sex	Freq
1	Black	Brown	Male	32
2	Brown	Brown	Male	53
3	Red	Brown	Male	10
4	Blond	Brown	Male	3
5	Black	Blue	Male	11
6	Brown	Blue	Male	50
7	Red	Blue	Male	10

To get a subpart of the df is as simple as:

In [36]:

df_sliced = _input.iloc[0:5, 0:2] # rows then columns indices passed to the iloc
df_sliced

Out[36]:

	Hair	Eye
1	Black	Brown
2	Brown	Brown
3	Red	Brown
4	Blond	Brown
5	Black	Blue

Transpose df¶

Given a DataFrame df, you can transpose a portion of it like this:

df_transposed = sub_df.T
# or
df_transposed =sub_df.transpose()

This will return a new df. To modify the original DataFrame, you can use the inplace parameter.

sub_df.T(inplace=True)

Create Contingency Table¶

Columns representing Hair , Eye color used to make a cross-table. Going from HairEyeColor table individuals to group counts done as follows.

Transpose¶

To switch rows by columns you call method .T on df.

In [37]:

df_T= df_sliced.T
df_T

Out[37]:

	1	2	3	4	5
Hair	Black	Brown	Red	Blond	Black
Eye	Brown	Brown	Brown	Brown	Blue

In [38]:

_input= data('HairEyeColor')

contingency_ds= pd.crosstab(_input["Hair"], _input["Eye"])

contingency_ds

Out[38]:

Eye	Blue	Brown	Green	Hazel
Hair
Black	2	2	2	2
Blond	2	2	2	2
Brown	2	2	2	2
Red	2	2	2	2

Pivot Table¶

Pivot is useful to make a report on relevant features, when you get counts of relevant values groupped by selected columns.

In [39]:

_input.head()

Out[39]:

	Hair	Eye	Sex	Freq
1	Black	Brown	Male	32
2	Brown	Brown	Male	53
3	Red	Brown	Male	10
4	Blond	Brown	Male	3
5	Black	Blue	Male	11

In [40]:

pivot = _input.pivot_table(index='Hair', columns=['Sex','Eye'], values='Freq')

pivot

Out[40]:

Sex	Female				Male
Eye	Blue	Brown	Green	Hazel	Blue	Brown	Green	Hazel
Hair
Black	9	36	2	5	11	32	3	10
Blond	64	4	8	5	30	3	8	5
Brown	34	66	14	29	50	53	15	25
Red	7	16	7	7	10	10	7	7

In [10]:

pivot2 = _input.pivot_table(index='Hair', columns=['Sex'], values='Freq')

pivot2

Out[10]:

Sex	Female	Male
Hair
Black	13.00	14.00
Blond	20.25	11.50
Brown	35.75	35.75
Red	9.25	8.50

apply() a Custom Function¶

Groupby¶

To apply groupby in pandas to get the count of each category, you can use the following code on dummy data:

In [13]:

import pandas as pd
import numpy as np

# categories
fruits = ['apple', 'banana', 'cherry', 'date', 'elderberry']

# total of 100 items
num_items = 100

# Randomly choose the number of items for each fruit type
counts = np.random.choice(range(1, num_items + 1), len(fruits), replace=False)
counts = counts / counts.sum() * num_items

# Create the data dictionary
data = {'group': [], 'attribute': []}

# Populate the data dictionary with random-sized groups
for count, fruit in zip(counts, fruits):
    data['group'].extend(np.repeat(fruit, int(count)))
    data['attribute'].extend(np.tile(fruit, int(count)))

df = pd.DataFrame(data)
# shuffle the DataFrame
df = df.sample(frac=1).reset_index(drop=True)

# Apply groupby and get the count of each type of fruit
fruit_counts = df.groupby('attribute').size().reset_index(name='count')
print(fruit_counts)

    attribute  count
0       apple     35
1      banana     42
2      cherry     10
3        date     10
4  elderberry      1

Filter Data Frame¶

use boolean indexing to filter a df based on a particular value in a specific column:

In [3]:

import pandas as pd

# Assuming 'df' is your original DataFrame
# For example, let's say you have a column named 'column_x' and you want to filter based on a specific value, let's call it 'desired_value'

desired_value,desired_value2 = 'apple', 'banana'  # Replace this with the actual value you want to filter on
filtered_df = df[(df['attribute'] == desired_value)  | (df['attribute'] == desired_value2)] 

# Now, 'filtered_df' contains only the rows where 'column_x' has the value equal to 'desired_value'
filtered_df

Out[3]:

	group	attribute
1	apple	apple
8	apple	apple
10	banana	banana
12	banana	banana
16	apple	apple
17	banana	banana
22	apple	apple
31	apple	apple
38	banana	banana
40	apple	apple
43	apple	apple
45	apple	apple
46	apple	apple
56	banana	banana
58	apple	apple
60	apple	apple
62	apple	apple
63	banana	banana
65	apple	apple
73	apple	apple
75	apple	apple
76	apple	apple
79	apple	apple
85	apple	apple
91	banana	banana
93	banana	banana
95	banana	banana

In [6]:

filtered_df.attribute.value_counts()

Out[6]:

apple     18
banana     9
Name: attribute, dtype: int64

pd.get_dummies¶

Of course I need to play with Titanic dataset and want to split class attribute into separate columns for first, second, and third class, I used pandas' get_dummies() function to create binary columns for each class. Then I add these binary columns to the original df.

PrettyLagom

Pandas Useful Tricks

Introduction¶

Data Structures¶

Pandas Series Object¶

Pandas DataFrame Object¶

Loading the Data into df¶

Inspecting the Data¶

Common Operations on Dataframe¶

Slice df¶

Transpose df¶

Create Contingency Table¶

Transpose¶

Pivot Table¶

apply() a Custom Function¶

Groupby¶

Filter Data Frame¶

pd.get_dummies¶

Introduction¶

Data Structures¶

Pandas Series Object¶

Pandas DataFrame Object¶

Loading the Data into df¶

Inspecting the Data¶

Common Operations on Dataframe¶

Slice df¶

Transpose df¶

Create Contingency Table¶

Transpose¶

Pivot Table¶

apply() a Custom Function¶

Groupby¶

Filter Data Frame¶

pd.get_dummies¶

links

social