Filter a DataFrame Based on A Condition

Boris Paskhaver
A free video tutorial from Boris Paskhaver
Software Engineer | Consultant | Author
4.7 instructor rating • 6 courses • 312,758 students

Lecture description

In this lesson, we'll filter rows from the DataFrame based on a single condition. The logic involves creating a Boolean Series of True and False values, then passing it in square brackets after our DataFrame.

Learn more from the full course

Data Analysis with Pandas and Python

Analyze data quickly and easily with Python's powerful pandas library! All datasets included --- beginners welcome!

20:33:02 of on-demand video • Updated November 2021

  • Perform a multitude of data operations in Python's popular "pandas" library including grouping, pivoting, joining and more!
  • Learn hundreds of methods and attributes across numerous pandas objects
  • Possess a strong understanding of manipulating 1D, 2D, and 3D data sets
  • Resolve common issues in broken or incomplete data sets
English [Auto] All right, in this lesson, we'll finally dive into filtering our data frames, and when I say filter, I mean extracting those rows that meet a certain condition. So let's begin by executing all of our code from the previous lesson and taking a look at the first three rows of our data frame. So here we have our fake list of employees and let's say I want to pull all of the male employees in our company, and in order to do that, I have to pull those rows where the value in the gender column is equal to male. So to begin the filter process, I have to focus on a single series, and that's going to be the series where I'm going to be checking for a condition. So in this example, I have to begin by selecting our gender series so I can do that by writing D.F. and then using my square brackets to extract and then writing gender in there. And there we have our series now in order to compare all of the values in a series to a specific value. And this is going to be a broadcasting operation. I'm actually going to use the Python sign the language line for comparison, which is the double equals sign. Now, you have to be very careful here, because if you accidentally use the single equal sign, what it's going to do is assign the values on the right to the values in that series. So, for example, what I want to do here is compare to mail, compare the values in the gender column to male. If I accidentally forgot that second equals it's simply going to overwrite all of the values in the gender column with that string value of male. So just be careful and remember that you always want those double equals for comparison. And what this is going to return to us is a brand new series and it's going to be a Boolean series. It's going to be a series that only has trues and falses and those trues and falses will reflect the accuracy of the condition. So, for example, in the first two rows, we have males. And because it is equal to male, those two rows are going to return. True. However, when it starts comparing the value in row two or rather the row with the index position of two, we have female, which is not going to be equal to male. So the evaluation for that row is going to return false. Let's take a look. So when I execute this, I'm going to get a series of booleans and those booleans reflect whether the condition that I'm evaluating is valid or not true or false. Now, this is where it gets a little bit funky in order to extract these specific rose ones where the gender column is male or in other words, where this is a true in this new boyin series, I have to pass this entire code into the square brackets following my data frame. So what I'm going to do is write D.F. here, which is my data frame, and then I'm going to put those square brackets. We use those in the previous lesson to extract single columns. It's the exact same syntax. But now within those square brackets, I'm going to pass the code that I just wrote. Now, what it's doing is it's operating on that internal code first, and that's generating the Boolean series that we see below. Whenever we pass a Boolean series within those square brackets, Panas knows that we want it to only extract those rows that have a true value. And in this case, those rows with a true value will be those that have male. So now if I execute this code, we're going to get a new data frame where you can see that it's only taken the rows that have male. So you can see as I scroll down and there's no females present. Let's do another example just to get us familiar with the syntax. So I'm going to hide this right here by double clicking. Let's say I want to pull every person who's on the team of finance. I can see one example here. Again, I have to begin by focusing on the specific series that I want to perform a test on, let's say I want to extract the team series and now I want to compare every value within this series to the word finance. So I'm going to use the double equals sign. Again, that's the literal comparison. It's going to check whether every single value here is equal to finance. And if it is, as we see here in rows two and three, it's going to return true for everything else, including null values. It's going to return false because those values are not equal to finance. NULL is also not equal to finance. So now that we have this booleans series of trues and falses, if we pass this within square brackets right after our name of our data frame, which is D.F., we're going to get that data frame where the values in the team column are equal only to finance. Now, the reason the syntax is a little bit funky is because we're going to have the data frame name listed twice. We're going to have the outside data frame variable and then the square brackets and then inside we're using that data frame variable again to extract one of the series. So we're going to have D.F. Bracket D.F. that Syntex can be a little bit confusing. So there is is a slightly more elegant way to do this. If you don't like the syntax, what we can do is assign the resulting boolean series. That's the first thing we generate here to its own separate variable. So let me give you an example. I'm once again going to repeat my code. I'm going to begin by extracting the team series, which is where I want to check my conditions. I'm now going to make it equal to finance, or rather I'm going to check whether the values are equal to finance, which is my string value here. I'm going to get my boolean series and here I'm going to assign this to a variable and usually a very common variable name for this is mask, but you can use whatever you'd like. Sometimes I use extract, sometimes I do condition. So now we have mask, which is storing the Boolean series that's generated by this code. And now after D.F. and my square brackets, instead of copying all the code that we have here, which is what we did in the previous examples, I can just write in Mask, which represents that Boolean series, and I'm going to get the exact same result, which is a data frame where the values in the team column are all finance. Let's do another example, and this one is going to be a little bit trickier. But if you're smart, you can figure it out pretty quickly. Let's say I want to extract the rows from our senior management column where we have a true so wherever we do have a senior management employee. So things like this, I want to extract these rows. If I begin by pulling out my senior management column, we can see that it already consists of boolean values. It's already a Boolean series. So while we could do something like compare these values to true, we don't have to because if we do, this is going to return the exact same series. Wherever it's true, it's going to say, oh, true is equal to true. That must be true. As long as we have a boolean series of any type and we have one right here, we can pass that directly in our square brackets. And pandas will return only the rows where there is a true value in the senior management column, as we can see here. Let's do a couple more examples. Let's talk about filtering based on a condition not being met, so let's say I want to pull any person who is not on the team of marketing. Once again, I'm going to begin by pulling the team column. And this time, instead of using the double equals sign to compare, literally, I'm going to use the not equal comparison, which is an exclamation point followed by an equal sign. And I'm going to check whether it's not equal to marketing. So any time we run into marketing, it's going to return a false now because the value is not equal to marketing. It's false because the value, in fact is equal. So now you can see here where we have marketing on the first row, we're going to get a false and every other value below, including the other teams and the null values are going to return. True, because those values are not equal to marketing. Now, if I have my condition, I can either feed it into those square brackets right here. That's one way of doing it. Or I can simply assign this Boolean series to its own separate variable and pass that variable in between my square brackets. Let's do a couple more examples. Instead of using the comparison and non comparison or non quality operator. Let's use greater than and less than this time. Let's pull everybody whose salary is greater than one hundred and ten thousand. So you can see here we have a salary column. That's the one that we're going to apply our condition on. So we have to begin by extracting it with D.F. bracket salary. And now instead of using either the double equals to compare literally to an exact salary or not equals to compare not against the specific salary. I'm just going to use the greater than sine and give it a value of one hundred ten thousand. So now it's going to return another billion series of trues and falses. And wherever we see true, the salary is greater than one hundred ten thousand. And wherever we see false, we know that the salary is not greater than one hundred and ten thousand. Now that we have that condition we can wrap that inside our square brackets and feed that right after the name of our data frame, which is D.F.. And now I felt it for the rose, where the salary is greater than one hundred and ten thousand, let's do yet another example. Let's do an example where we do than let's pull any employee who has a bonus percentage, less than one point five. I'll begin by extracting my bonus percentage column. It's going to type it out and then press tab. So there's my bonus percentage column of floats. I'm going to use the less than sign in order to compare to those that are less than one point five. Wherever the value in the series is less than one point five, I'm going to get a true and a false otherwise. And now if I wrap that in square brackets and write the data frame in front of it, D.F. Now I have extracted those rows where the bonus, as we can see, is less than one point five. And the great thing about this as well is that this less than or greater than Syntex also works with dates. And this will be our last example. Let me collapse this. Let's say we want to pull all employees who started before or on January 1st. Nineteen eighty five. So we have a start date column in our data frame. Let's begin by extracting it. So I'm going to do D.F. square brackets start date. And there we have our series of dates and now I'm going to use the exact same syntax less then. But in order to make it less than or equal to, I can do this syntax less than or equal to with the less than sign immediately followed by the equals sign. And I can actually write in the date exactly as a string. So I'm going to mirror the format we see below and I'm just going to enter my date. So I want to compare it to January 1st, nineteen eighty five which is nineteen eighty five, dash zero one, dash zero one. That's going to return a Boolean series, which is going to be true if the date is prior to or on that date and they false if it does not fall before January 1st. Nineteen eighty five. Now that I have that boolean series I can assign it to a variable like mask. And finally after the name of my data frame, which is D.F., I can write those square brackets, feed that variable that represents my boolean series and take a look. And there, as you can see, the start date in every single column here is less than or equal to January 1st. Nineteen eighty five. So here we have all of our nineteen eighties employees. As we scroll down, we can see that all of these dates do fall before. So we do have the comparison operator with the double equals, the non equality operator, which is the exclamation point in the equals. We have less than less than or equal to greater than and greater than an equal to. And you can use all of these to set up your own boolean series. And as long as you feed in that boolean series into those square brackets following the data frame name, it's going to extract only those rows that fit that condition. Or in other words, only those rows where the Boolean series has a true for that specific row. And if that Syntex gets a little bit too inelegant or a little bit too ugly due to the multiple square brackets and the multiple names of the data frames, you can always store that Boolean series in its own separate variable and simply feed that variable into the square brackets after the data frame name. So that's how we filter a data frame based on a single condition. And in the next lesson will expand on this topic to talk about how we can filter with more than one condition.