Some flexible approaches to combine multiple filters
Photo by Nathan Dumlao on Unsplash
Looking for some good book recommendations in Goodreads ? I found this quote,
In the Information Age, the first step to sanity is FILTERING. Filter the information: extract for knowledge. ? Marc Stiegler
I wondered how it applies to Data Science! This quote fits perfectly to the most significant and the most underrated step in the entire Data Science process ? Data Preprocessing! Data Scientists enjoy building models so much that they overlook this process. In actual essence, this process can be intriguing. As mentioned in the quote above, filtering is knowledgable. Filtering data can really guarantee some sanity when you are stumbled upon which variables to fit on the model. In this article, I?ll share some quick ways of filtering data using Pandas.
To jump in, I have obtained this dataset from Kaggle. The dataset contains some sample sales data and looks like the one below.
Sample Sales Data
A simple look into the data and what it has to say:
Sales Data Shape and ColumnsSales Data Info
Looks good! Let?s dive in.
1) Filtering based on one condition:
There is a DEALSIZE column in this dataset which is either small or medium or large Let?s say we want to know the details of all the large deals. A simple way would be,
Large Deals
Filtering is pretty candid here. You pick the column and match it with the value you want. A common confusion when it comes to filtering in Pandas is the use of conditional operators. Python syntax creates trouble for many.
Note: In Pandas, and is replaced with & , or is replaced with | and not is replaced with ~
I find out that Madrid is the top-ranking city in terms of revenue. I?d like to compare the sales details of Madrid against all the other cities. This can be achieved by assigning conditions to variables.
Comparing Madrid against other cities
Assigning conditions to variables can be a good practice since it creates a mask that can be applied whenever we want to access data under such conditions.
2) Filtering based on multiple conditions:
Let?s see if we can find all the countries where the order is on hold in the year 2005
Applying Multiple filters
Inside .loc , the condition within the parentheses evaluates to a boolean value which is then applied upon the column specified.
The data returned from multiple filters depends on the operation performed. When & and | operations are performed without an assignment, a series is returned. When the | operation is performed with an assignment, it modifies the DataFrame.
The list of conditions to be performed upon the DataFrame can increase drastically. Let?s consider a use case. I find out that Spain ranks second in generating total revenue, see if there any orders in Spain where the Sales didn?t cross 5000 and the Quantityordered is less than 50. This can be done in two ways:
- Either hard-code the list
Hard Coding Multiple conditions
- Or build a list that is dynamically evaluated based on the criteria
Dynamically evaluated list
3) Implementing the If then Constructs:
When solving problems, a format such as this is most common:
Consider a case where a new column called Income Statement is created that contains three categories ? if sales is greater than 10000 then it?s considered gain, if the sales range is between 5000?10000 then it?s considered as no change and anything less than that is a loss Let?s look at a few ways of implementing this:
- Define a function that executes this logic and apply that to all columns in a DataFrame
?if elif else? inside a function
- Using a lambda function
using a lambda function
- Implementing a loop can be faster than .apply
A loopy solution
These three are more pandas-y ways of arriving at the solution. There are many other alternatives to arrive at the solution. Feel free to check my GitHub repo here that contains all the code mentioned in the article!
Thank you for reading all the way down here. Let me know in the comment section if you have any concerns, feedback, or criticism. Have a good day!