Pandas Dataframe.filter() is an inbuilt function that is used to subset columns or rows of DataFrame according to labels in the particular index. The DataFrame filter() returns subset the DataFrame rows or columns according to the detailed index labels. One thing to note that this routine does not filter a DataFrame on its contents. The filter() function is applied to the labels of the index.
How To Filter Pandas Dataframe
Python Pandas allows us to slice and dice the data in multiple ways. Often, you may want to subset a pandas dataframe based on one or more values of a specific column. Necessarily, we would like to select rows based on one value or multiple values present in a column.
To filter data in Pandas, we have the following options.
- Use Pandas filter() method.
- Use Pandas query() function
- Use Pandas DataFrame indices.
Syntax
DataFrame.filter(self: ~FrameOrSeries, items=None, like: Union[str, NoneType] = None, regex: Union[str, NoneType] = None, axis=None)
Parameters
items: list-like
Keep labels from the axis which are in items.
like: str
Keep labels from the axis for which “like in label == True”.
regex: str (regular expression)
Keep labels from axis for which re.search(regex, label) == True.
axis: {0 or ‘index’, 1 or ‘columns’, None}, default value is None.
The axis to filter on, expressed either as the index (int) or axis name (str).
By default, this is an info axis, ‘index’ for Series, ‘columns’ for DataFrame.
Example
Let’s use an external CSV file for this example. The file I am using is called People.csv file, and we will import the data using pandas read_csv() function.
Then we will create a DataFrame from the CSV data.
For this example, we only select the first 10 rows, so I have used DataFrame.head() function to limit the rows to 10.
Then we will use the filter() function to select the data based on labels.
See the below code.
# app.py import pandas as pd dt = pd.read_csv('people.csv') df = pd.DataFrame(data=dt) df10 = df.head(10) print(df10)
Output
python3 app.py Name Sex Age Height Weight 0 Alex M 41 74 170 1 Bert M 42 68 166 2 Carl M 32 70 155 3 Dave M 39 72 167 4 Elly F 30 66 124 5 Fran F 33 66 115 6 Gwen F 26 64 121 7 Hank M 30 71 158 8 Ivan M 53 72 175 9 Jake M 32 69 143
You can see that we have a total of 5 columns and 10 rows.
Now, we will select only Name, Height, and Weight using the Pandas filter() method.
# app.py import pandas as pd dt = pd.read_csv('people.csv') df = pd.DataFrame(data=dt) df10 = df.head(10) print(df10) dFilter = df10.filter(['Name', 'Height', 'Weight']) print(dFilter)
Output
python3 app.py Name Height Weight 0 Alex 74 170 1 Bert 68 166 2 Carl 70 155 3 Dave 72 167 4 Elly 66 124 5 Fran 66 115 6 Gwen 64 121 7 Hank 71 158 8 Ivan 72 175 9 Jake 69 143
Pandas filter with Python regex
Let’s pass a regular expression parameter to the filter() function.
Python RegEx or Regular Expression is the sequence of characters that forms the search pattern.
Python RegEx can be used to check if the string contains the specified search pattern.
Let’s select columns by its name that contain ‘A’.
# app.py import pandas as pd dt = pd.read_csv('people.csv') df = pd.DataFrame(data=dt) df10 = df.head(10) dFilter = df10.filter(regex ='[A]') print(dFilter)
Output
python3 app.py Age 0 41 1 42 2 32 3 39 4 30 5 33 6 26 7 30 8 53 9 32
The regular expression ‘[A]‘ looks for all column names, which has an ‘A’.
We have only one column that contains A; that is why it returns the Age column.
Pandas select rows in DataFrame using filter()
Let’s select data based on the index of the DataFrame.
# app.py import pandas as pd dt = pd.read_csv('people.csv') df = pd.DataFrame(data=dt) df10 = df.head(10) dFilter = df10.filter(like='6', axis=0) print(dFilter)
In the above code, we are selecting that row whose index is 6.
In Pandas DataFrame, the index starts from 0. So 6 must be 7th index in the DataFrame.
python3 app.py Name Sex Age Height Weight 6 Gwen F 26 64 121
Pandas filter using df.query()
The filter() is not the only function we can use to filter the rows and columns.
Pandas DataFrame.query() is an inbuilt function that is useful to filter the rows.
See the following code.
# app.py import pandas as pd dt = pd.read_csv('people.csv') df = pd.DataFrame(data=dt) df10 = df.head(10) dFilter = df10.query('Age>40') print(dFilter)
Output
python3 app.py Name Sex Age Height Weight 0 Alex M 41 74 170 1 Bert M 42 68 166 8 Ivan M 53 72 175
In the above example, we are filtering the rows which have Age < 40. So we will get all the people who have Age > 40.
It is like SQL SELECT Query with WHERE Clause.
Pandas DataFrame filter multiple columns
We can filter multiple columns in Pandas DataFrame using & operator, don’t forget to wrap the sub-statements with().
See the following code.
# app.py import pandas as pd dt = pd.read_csv('people.csv') df = pd.DataFrame(data=dt) df10 = df.head(10) dFilter = df10[(df10.Age >= 40) & (df10.Sex == 'M')] print(dFilter)
Output
python3 app.py Name Sex Age Height Weight 0 Alex M 41 74 170 1 Bert M 42 68 166 8 Ivan M 53 72 175
In the above code, we are filtering the data based on two conditions.
- Age > 40
- Sex == M
If one then is False, then it filters out that data.
The returning data will satisfy our conditions.
Conclusion
We can filter Pandas DataFrame using df.filter(), df.query(), and df[] indices method.
We can also filter multiple columns by using & operator.