Wed. Oct 5th, 2022

Hello Android

All android in one place

Data Summarization Using Pandas In Python

4 min read

Pandas, Pandas and Pandas. When it comes to data manipulation and analysis, nothing can serve the purpose better than Pandas. In previous stories, we have learned many data operations using pandas. Today is another day where we are going to explore the data summarization topic using pandas in python. So, without wasting much time on the intro, let’s roll!


Data Summarization

The word data summarization is nothing but extracting and presenting the raw data as a summary of it. Just presenting the raw data cannot make any sense to your audience. So, breaking the data into subsets and then gathering or summarizing the insights can craft a neat story any day. 

Pandas offers many functions such as count, value counts, crosstab, group by, and more to present the raw data in an informative way.

Well, in this story, we are going to explore all the data summarization techniques using pandas in python.


Pandas Count

Pandas count is a very simple function that is used to get the count of the data points. Its applications are limited compared to crosstab and Groupby. But, it is quite useful at all times.

Before we move forward, let’s install all the required libraries for data summarization in python.

#Pandas
import pandas as pd

#Numpy
import numpy as np

#Matplotlib 
import matplotlib.pyplot as plt

#seaborn 
import seaborn as sns

Now, let’s load our Titanic data. The reason I am using this data is, it is pretty easy to understand the data summarization using these attributes. So, if you are a beginner or a pro, it will best suit the purpose.

#titanic data

import pandas as pd

data = pd.read_csv('titanic.csv')

We can dig deep to understand the basic information about the data.

#data columns 

data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Well, we have both numerical and categorical data types in our data and it will spice up things for sure.

Now, it’s time to count the values present in both rows and columns.

#count of values in columns 

data.count(0)
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

You can see that most of the columns have 891 values. But columns such as cabin and Age have less value. It indicates the presence of null values or missing data. Let’s look at the rows for the same.

#count of values in rows

data.count(1)
0      11
1      12
2      11
3      12
4      11
       ..
886    11
887    12
888    10
889    12
890    11
Length: 891, dtype: int64

You can observe that not all the rows have the same number of values. An ideal row of this data should have 12 values.

Index

You can observe or inspect the data by index level as well. Let’s use set_index function for the same.

#set index 

data = data.set_index(['Sex','Pclass'])
data.head(2)
Count 11

That’s our index level data watch!

Now, we have 2 attributes as our data index. So, let’s set the count level as ‘Sex’ to get the particular data.

#count level 

data.count(level = 'Sex')
data summarization

Similarly for ‘Pclass’

#count level 

data.count(level = 'Pclass')
Count 3

That’s ‘some’ information you need to work with data modeling.


Pandas Value_counts

The value counts function has more functionality compared to the count function with 1-2 lines of code. Definitely, it will earn more respect in your eyes as it can perform the operations of the group by functioning more seamlessly.

#value counts

data.value_counts(['Pclass'])
Pclass
3         491
1         216
2         184
dtype: int64

That’s cool. We now have information about all three classes and the values that belong to each of them.

One of the best features of the value_counts function is, you can even normalize the data.

#normalization 

data.value_counts(['Pclass'], normalize = True, sort = True, ascending = True)
Pclass
2         0.206510
1         0.242424
3         0.551066
dtype: float64

Here, we have not only normalized the values but also sorted the values in ascending order which makes some sense

For the data attribute which has no levels in it such as “fare”, we can create the bins. Let’s see how it works.

#bins

data['Fare'].value_counts(bins=5)
(-0.513, 102.466]     838
(102.466, 204.932]     33
(204.932, 307.398]     17
(409.863, 512.329]      3
(307.398, 409.863]      0
Name: Fare, dtype: int64

Well, we have created 5 bin ranges for the “fare”. Most of the ticket prices are in the 0 – 100 range and belong to Pclass 1.


Pandas Crosstab

A crosstab is a simple function that shows the relationship between two variables. It is very handy to quickly analyze two variables.

Now, let’s see the relationship between Sex and the Survivability of the passengers in the data.

#crosstab

pd.crosstab(data['Sex'],data['Survived'])
Survived     0	   1
Sex		
female  	81	  233
male	    468	  109

You can see the clear relationship between Sex with Survivability. We can plot this data for better visibility.

data summarization

That’s cool! I hope things were better now.

In the crosstab, we can do so much. We can add multiple data layers in the cross tab and even we can visualize the same.

#multiple layers crosstab

pd.crosstab([data['Pclass'], data['Sex']], [data['Embarked'], data['Survived']],
           rownames = ['Pclass', 'gender'],
           colnames = ['Embarked', 'Survived'],
           dropna=False)
data summarization

There is a lot of information in just one table. That’s crosstab for you! Finally, let’s plot the correlation plot for this table data, and let’s see how it works.

#correlation 

import seaborn as sns
sns.heatmap(pd.crosstab([data['Pclass'],data['Sex']],[data['Embarked'],data['Survived']]),annot = True)
data summarization

We have got an amazing correlation plot showing key information about the data.


Data Summarization – Conclusion

Data manipulation and analysis are most important as you will get to know about key insights and hidden patterns in your data. In this regard, data summarization is one of the best techniques you can make use of to get into your data for the best analysis.

That’s all for now and I hope this story helps you in your analysis. Happy Python!!!

More read: Data manipulation and statistical analysis

www.hello-android.com

Leave a Reply

Your email address will not be published. Required fields are marked *

Hello android © All rights reserved. | Newsphere by AF themes.