Introduction¶
Here I want to write a little about pandas that I play with sometimes and share some useful pandas methods. Notice the input data comes from module pydataset that comes with a bunch of popular dataset.
import pandas as pd
from pydataset import data
type(data)
_input= data('HairEyeColor')
Data Structures¶
There are n structured ways to manipulate data in pandas.
Pandas Series Object¶
Series is data wrapped in a 1 d array with indexed items, indices being of type int (but str also works).
Slicing works by square-bracket notation, method .values return Numpy array.
data = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
Curious application is a Series as a specialization of a Python dictionary.
population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135} population = pd.Series(population_dict) population
Pandas DataFrame Object¶
A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.
from sklearn import datasets
iris = datasets.load_iris()
print(iris.data[:10])
print(iris.target)
print(iris.DESCR[:1000])
There are datasets to play with in Scikit Learn or Scipy.
from sklearn.datasets import load_boston, make_classification, load_iris
data_boston = load_boston
data_iris = load_iris
Otherwise sklearn has a utility to generate classification datasets.
from sklearn.datasets import make_classification
When input looks like that:
_input.head(7)
To get a subpart of the df is as simple as:
df_sliced = _input.iloc[0:5, 0:2] # rows then columns indices passed to the iloc
df_sliced
Transpose df¶
Given a DataFrame df, you can transpose a portion of it like this:
df_transposed = sub_df.T
# or
df_transposed =sub_df.transpose()
This will return a new df. To modify the original DataFrame, you can use the inplace parameter.
sub_df.T(inplace=True)
Create Contingency Table¶
Columns representing Hair , Eye color used to make a cross-table. Going from HairEyeColor table individuals to group counts done as follows.
Transpose¶
To switch rows by columns you call method .T
on df
.
df_T= df_sliced.T
df_T
_input= data('HairEyeColor')
contingency_ds= pd.crosstab(_input["Hair"], _input["Eye"])
contingency_ds
Pivot Table¶
Pivot is useful to make a report on relevant features, when you get counts of relevant values groupped by selected columns.
_input.head()
pivot = _input.pivot_table(index='Hair', columns=['Sex','Eye'], values='Freq')
pivot
pivot2 = _input.pivot_table(index='Hair', columns=['Sex'], values='Freq')
pivot2
apply() a Custom Function¶
Groupby¶
To apply groupby in pandas to get the count of each category, you can use the following code on dummy data:
import pandas as pd
import numpy as np
# categories
fruits = ['apple', 'banana', 'cherry', 'date', 'elderberry']
# total of 100 items
num_items = 100
# Randomly choose the number of items for each fruit type
counts = np.random.choice(range(1, num_items + 1), len(fruits), replace=False)
counts = counts / counts.sum() * num_items
# Create the data dictionary
data = {'group': [], 'attribute': []}
# Populate the data dictionary with random-sized groups
for count, fruit in zip(counts, fruits):
data['group'].extend(np.repeat(fruit, int(count)))
data['attribute'].extend(np.tile(fruit, int(count)))
df = pd.DataFrame(data)
# shuffle the DataFrame
df = df.sample(frac=1).reset_index(drop=True)
# Apply groupby and get the count of each type of fruit
fruit_counts = df.groupby('attribute').size().reset_index(name='count')
print(fruit_counts)
Filter Data Frame¶
use boolean indexing to filter a df based on a particular value in a specific column:
import pandas as pd
# Assuming 'df' is your original DataFrame
# For example, let's say you have a column named 'column_x' and you want to filter based on a specific value, let's call it 'desired_value'
desired_value,desired_value2 = 'apple', 'banana' # Replace this with the actual value you want to filter on
filtered_df = df[(df['attribute'] == desired_value) | (df['attribute'] == desired_value2)]
# Now, 'filtered_df' contains only the rows where 'column_x' has the value equal to 'desired_value'
filtered_df
filtered_df.attribute.value_counts()
pd.get_dummies¶
Of course I need to play with Titanic dataset and want to split class attribute into separate columns for first, second, and third class, I used pandas' get_dummies() function to create binary columns for each class. Then I add these binary columns to the original df.