3 ways to filter Pandas DataFrame by column values

3 ways to filter Pandas DataFrame by column values

Some flexible approaches to combine multiple filters

Image for postPhoto by Nathan Dumlao on Unsplash

Looking for some good book recommendations in Goodreads ? I found this quote,

In the Information Age, the first step to sanity is FILTERING. Filter the information: extract for knowledge. ? Marc Stiegler

I wondered how it applies to Data Science! This quote fits perfectly to the most significant and the most underrated step in the entire Data Science process ? Data Preprocessing! Data Scientists enjoy building models so much that they overlook this process. In actual essence, this process can be intriguing. As mentioned in the quote above, filtering is knowledgable. Filtering data can really guarantee some sanity when you are stumbled upon which variables to fit on the model. In this article, I?ll share some quick ways of filtering data using Pandas.

To jump in, I have obtained this dataset from Kaggle. The dataset contains some sample sales data and looks like the one below.

Image for postSample Sales Data

A simple look into the data and what it has to say:

Image for postSales Data Shape and ColumnsImage for postSales Data Info

Looks good! Let?s dive in.

1) Filtering based on one condition:

There is a DEALSIZE column in this dataset which is either small or medium or large Let?s say we want to know the details of all the large deals. A simple way would be,

Image for postLarge Deals

Filtering is pretty candid here. You pick the column and match it with the value you want. A common confusion when it comes to filtering in Pandas is the use of conditional operators. Python syntax creates trouble for many.

Note: In Pandas, and is replaced with & , or is replaced with | and not is replaced with ~

I find out that Madrid is the top-ranking city in terms of revenue. I?d like to compare the sales details of Madrid against all the other cities. This can be achieved by assigning conditions to variables.

Image for postComparing Madrid against other cities

Assigning conditions to variables can be a good practice since it creates a mask that can be applied whenever we want to access data under such conditions.

2) Filtering based on multiple conditions:

Let?s see if we can find all the countries where the order is on hold in the year 2005

Image for postApplying Multiple filters

Inside .loc , the condition within the parentheses evaluates to a boolean value which is then applied upon the column specified.

The data returned from multiple filters depends on the operation performed. When & and | operations are performed without an assignment, a series is returned. When the | operation is performed with an assignment, it modifies the DataFrame.

The list of conditions to be performed upon the DataFrame can increase drastically. Let?s consider a use case. I find out that Spain ranks second in generating total revenue, see if there any orders in Spain where the Sales didn?t cross 5000 and the Quantityordered is less than 50. This can be done in two ways:

  • Either hard-code the list

Image for postHard Coding Multiple conditions

  • Or build a list that is dynamically evaluated based on the criteria

Image for postDynamically evaluated list

3) Implementing the If then Constructs:

When solving problems, a format such as this is most common:

Consider a case where a new column called Income Statement is created that contains three categories ? if sales is greater than 10000 then it?s considered gain, if the sales range is between 5000?10000 then it?s considered as no change and anything less than that is a loss Let?s look at a few ways of implementing this:

  • Define a function that executes this logic and apply that to all columns in a DataFrame

?if elif else? inside a function

  • Using a lambda function

using a lambda function

  • Implementing a loop can be faster than .apply

A loopy solution

These three are more pandas-y ways of arriving at the solution. There are many other alternatives to arrive at the solution. Feel free to check my GitHub repo here that contains all the code mentioned in the article!

Thank you for reading all the way down here. Let me know in the comment section if you have any concerns, feedback, or criticism. Have a good day!


No Responses

Write a response