Tue. Oct 4th, 2022

Hello Android

All android in one place

Data Sampling Using Pandas In Python

4 min read

Hello folks, today let’s shed some light on data sampling using python pandas. Data sampling is a statistical technique that allows us to get information from large data. In other words, we will get the sample out of the population.

But why do we need Data Sampling?

Many times, data can be huge and it’s a common case in Big data analytics. There are millions of data records that trouble you from effectively analyzing it. In these cases, you can go for sampling and examine the small chunk of data to get some insights.

Let’s consider you conduct a large-scale survey.

You have to find the average height of adults in New York City. There are over 6.5 million adults in this city. It will be impossible to reach out to every induvial and record their height. And also, you cannot enter a basketball ground and take the height of people there. Because generally, all those people have greater heights than others.

Finally, we can neither reach out to all nor reach specific people. So, what’s next?

Here comes sampling. Here, you have to take samples at a random time, places, and people and then compute the average of those values to get the average height of adults in NY.


Types of Data Sampling

Yes, we do have multiple data sampling methods. In this story, we will be discussing the below three –

  • Random sampling
  • Condition bases sampling
  • Constant rate sampling

Random Sampling: In this sampling technique, every sample has an equal chance of getting picked up. Due to its unbiased nature, it will be much helpful for concluding. 

Condition bases sampling: This sampling technique is used to specify the sample selection based on the conditions or criteria.

Constant rate sampling: Here, you will be mentioning the rate at which the sample is being selected. This will allow a constant distance between the selected samples. 


Setting Up Data

We will be using the iris dataset for this purpose. But, never ever think the data in real-world will be this small 😛

#import pandas

import pandas as pd

#load data

data = pd.read_csv('irisdata.csv')
Data sampling
  • Import the pandas module.
  • Call the read_csv function and load the data.
  • Use data.head() function to peek into the data.

1. Random Sampling

The idea of random sampling states that if we have N rows, then it will extract X rows from that (X < N). You have to use pandas sample() function for this.

#subset the data

subset_data = data.sample(n=100)

subset_data
Data sampling

Here, we have passed the number of rows parameter to the sample function to get this subset of the data. But, you can also mention the sampling rows in percentage. Let’s see how.

#sampling with percentage

subset_data_percentage = data.sample(frac=0.5)

subset_data_percentage
Sampling Random

You can confirm the size of the sampled data using the shape function as shown below.

#shape of the data

subset_data_percentage.shape
(75, 5)

As we have mentioned the 50% of the data needs to be sampled, here we have 75 rows, half of the original data with random rows.


2. Conditional Sampling

Based on the case, you can opt for condition-based sampling. Here, by specifying a condition, you can extract the rows which satisfy it. Let’s see how it works.

#conditonal sampling
our_condition = data['Species'] == 'Iris-setosa'

#Retirive the index
index = our_condition[our_condition == True].index

#sample based on condition 
conditional_subset = data[our_condition].sample(n = 10)

#output 
conditional_subset
conditional method

Check the shape of the sampled data.

#shape

conditional_subset.shape
(10, 5)

Here –

  • We have defined the condition.
  • Retrieved the indexes of the samples.
  • Sampled the data based on the condition.

3. Constant Rate Sampling

In this sampling method, we will get the samples based on constant intervals or the rate. In the below example we will be getting the samples at rate 2. Let’s see how it works.

#defining rate
our_rate = 2

#apply the rate
constant_subset = data[::our_rate]

#data
constant_subset
constant rate

You can observe that every second data record is retrieved as a subset of the original data.

Now, we have sampled the data using multiple methods. But what if you want to retrieve the remaining data?

Pass to the next heading…


Data Sampling – Data Retrieval

To get the remaining data or the data apart from sampled data, there are two methods for it. Let’s see both of them.

The first one is, it will drop the sampled data and presents the remaining data.

#First method

remaining_data = data.drop(labels=constant_subset.index)

remaining_data
retrieval of data

Here, you can observe that sampled out data or the remaining data is been produced as output.

In the second method, we will be selecting only those rows which are not involved in sampling. In simple words, we will be selecting data in the second method and dropping data in the first method.

#second method

remaining_data_method2 = data[~data.index.isin(constant_subset.index)]

remaining_data_method2
retrieval

Observe that same output here. Method changes but not the result.


Data Sampling – Conclusion

Data sampling is one of the key aspects of statistical data analysis. It has many applications and using it you can extract meaningful insights out of big data. I hope you now got an idea of using data sampling in your data work, so that big data is no bigger…

That’s all as of now. Happy Python!!!

More read: Sampling techniques

www.hello-android.com

Leave a Reply

Your email address will not be published. Required fields are marked *

Hello android © All rights reserved. | Newsphere by AF themes.