Pandas Useful Tricks

Introduction

Here I want to write a little about pandas that I play with sometimes and share some useful pandas methods. Notice the input data comes from module pydataset that comes with a bunch of popular dataset.

In [30]:
import pandas as pd
from pydataset import data
type(data)

_input= data('HairEyeColor')

Data Structures

There are n structured ways to manipulate data in pandas.

Pandas Series Object

Series is data wrapped in a 1 d array with indexed items, indices being of type int (but str also works).

Slicing works by square-bracket notation, method .values return Numpy array.

data = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

Curious application is a Series as a specialization of a Python dictionary.

population_dict = {'California': 38332521,  'Texas': 26448193,  'New York': 19651127,  'Florida': 19552860,  'Illinois': 12882135}  population = pd.Series(population_dict)  population

Pandas DataFrame Object

A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

Loading the Data into df

In this example I will manipulate data that comes from additional dependency: scikit-learn.

Inspecting the Data
In [31]:
from sklearn import datasets
iris = datasets.load_iris()

print(iris.data[:10])
print(iris.target)
print(iris.DESCR[:1000])
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ===========

There are datasets to play with in Scikit Learn or Scipy.

from sklearn.datasets import load_boston, make_classification, load_iris

data_boston = load_boston
data_iris = load_iris

Otherwise sklearn has a utility to generate classification datasets.

from sklearn.datasets import make_classification

Common Operations on Dataframe

Slice df

Given df you can get portion of it like this:

sub_df = df[start_row:end_row][start_col:end_col]

You provide integers for indices of rows and columns.

When input looks like that:

In [35]:
_input.head(7)
Out[35]:
Hair Eye Sex Freq
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
4 Blond Brown Male 3
5 Black Blue Male 11
6 Brown Blue Male 50
7 Red Blue Male 10

To get a subpart of the df is as simple as:

In [36]:
df_sliced = _input.iloc[0:5, 0:2] # rows then columns indices passed to the iloc
df_sliced
Out[36]:
Hair Eye
1 Black Brown
2 Brown Brown
3 Red Brown
4 Blond Brown
5 Black Blue

Transpose df

Given a DataFrame df, you can transpose a portion of it like this:

df_transposed = sub_df.T
# or
df_transposed =sub_df.transpose()

This will return a new df. To modify the original DataFrame, you can use the inplace parameter.

sub_df.T(inplace=True)

Create Contingency Table

Columns representing Hair , Eye color used to make a cross-table. Going from HairEyeColor table individuals to group counts done as follows.

Transpose

To switch rows by columns you call method .T on df.

In [37]:
df_T= df_sliced.T
df_T
Out[37]:
1 2 3 4 5
Hair Black Brown Red Blond Black
Eye Brown Brown Brown Brown Blue
In [38]:
_input= data('HairEyeColor')

contingency_ds= pd.crosstab(_input["Hair"], _input["Eye"])

contingency_ds
Out[38]:
Eye Blue Brown Green Hazel
Hair
Black 2 2 2 2
Blond 2 2 2 2
Brown 2 2 2 2
Red 2 2 2 2

Pivot Table

Pivot is useful to make a report on relevant features, when you get counts of relevant values groupped by selected columns.

In [39]:
_input.head()
Out[39]:
Hair Eye Sex Freq
1 Black Brown Male 32
2 Brown Brown Male 53
3 Red Brown Male 10
4 Blond Brown Male 3
5 Black Blue Male 11
In [40]:
pivot = _input.pivot_table(index='Hair', columns=['Sex','Eye'], values='Freq')

pivot
Out[40]:
Sex Female Male
Eye Blue Brown Green Hazel Blue Brown Green Hazel
Hair
Black 9 36 2 5 11 32 3 10
Blond 64 4 8 5 30 3 8 5
Brown 34 66 14 29 50 53 15 25
Red 7 16 7 7 10 10 7 7
In [10]:
pivot2 = _input.pivot_table(index='Hair', columns=['Sex'], values='Freq')

pivot2
Out[10]:
Sex Female Male
Hair
Black 13.00 14.00
Blond 20.25 11.50
Brown 35.75 35.75
Red 9.25 8.50

apply() a Custom Function

Groupby

To apply groupby in pandas to get the count of each category, you can use the following code on dummy data:

In [13]:
import pandas as pd
import numpy as np

# categories
fruits = ['apple', 'banana', 'cherry', 'date', 'elderberry']

# total of 100 items
num_items = 100

# Randomly choose the number of items for each fruit type
counts = np.random.choice(range(1, num_items + 1), len(fruits), replace=False)
counts = counts / counts.sum() * num_items

# Create the data dictionary
data = {'group': [], 'attribute': []}

# Populate the data dictionary with random-sized groups
for count, fruit in zip(counts, fruits):
    data['group'].extend(np.repeat(fruit, int(count)))
    data['attribute'].extend(np.tile(fruit, int(count)))

df = pd.DataFrame(data)
# shuffle the DataFrame
df = df.sample(frac=1).reset_index(drop=True)

# Apply groupby and get the count of each type of fruit
fruit_counts = df.groupby('attribute').size().reset_index(name='count')
print(fruit_counts)
    attribute  count
0       apple     35
1      banana     42
2      cherry     10
3        date     10
4  elderberry      1

Filter Data Frame

use boolean indexing to filter a df based on a particular value in a specific column:

In [3]:
import pandas as pd

# Assuming 'df' is your original DataFrame
# For example, let's say you have a column named 'column_x' and you want to filter based on a specific value, let's call it 'desired_value'

desired_value,desired_value2 = 'apple', 'banana'  # Replace this with the actual value you want to filter on
filtered_df = df[(df['attribute'] == desired_value)  | (df['attribute'] == desired_value2)] 

# Now, 'filtered_df' contains only the rows where 'column_x' has the value equal to 'desired_value'
filtered_df
Out[3]:
group attribute
1 apple apple
8 apple apple
10 banana banana
12 banana banana
16 apple apple
17 banana banana
22 apple apple
31 apple apple
38 banana banana
40 apple apple
43 apple apple
45 apple apple
46 apple apple
56 banana banana
58 apple apple
60 apple apple
62 apple apple
63 banana banana
65 apple apple
73 apple apple
75 apple apple
76 apple apple
79 apple apple
85 apple apple
91 banana banana
93 banana banana
95 banana banana
In [6]:
filtered_df.attribute.value_counts()
Out[6]:
apple     18
banana     9
Name: attribute, dtype: int64

pd.get_dummies

Of course I need to play with Titanic dataset and want to split class attribute into separate columns for first, second, and third class, I used pandas' get_dummies() function to create binary columns for each class. Then I add these binary columns to the original df.

links

social