If you want to find duplicate rows in a DataFrame based on all or selected columns, use the pandas.dataframe.duplicated() function. In Data Science, sometimes, you get a messy dataset. For example, you may have to deal with duplicates, which will skew your analysis.
Pandas duplicate rows
To find duplicate rows in Pandas DataFrame, use the pd.df.duplicated() function. Pandas.DataFrame.duplicated() is a library function that finds duplicate rows based on all or specific columns. The pd.duplicated() function returns a Boolean Series with a True value for each duplicated row.
Syntax
The syntax of pandas.dataframe.duplicated() function is following.
DataFrame.duplicated(subset=None, keep='first')
Parameters
- subset :
- Single or multiple column labels should be used for duplication checks. If you do not provide them, then all the columns will be checked for finding duplicate rows.
- keep :
- It denotes the occurrence, which should be marked as duplicate. Its value can be {”first”, ”last”, False}, default value is ”first”.
- first: All the duplicates except their first occurrence will be marked as True.
- last: All the duplicates except their last occurrence will be marked as True.
- False: All the duplicates except will be marked as True.
- It denotes the occurrence, which should be marked as duplicate. Its value can be {”first”, ”last”, False}, default value is ”first”.
Example
Let’s create a sample DataFrame that contains duplicate values.
# app.py import pandas as pd series = [('Stranger Things', 3, 'Millie'), ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'), ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'), ('La Casa De Papel', 4, 'Sergio')] # Create a DataFrame object dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor']) print(dfObj)
Output
python3 app.py Name Seasons Actor 0 Stranger Things 3 Millie 1 Game of Thrones 8 Emilia 2 La Casa De Papel 4 Sergio 3 Westworld 3 Evan Rachel 4 Stranger Things 3 Millie 5 La Casa De Papel 4 Sergio
As you can see, the above dataframe contains duplicate rows.
Find Duplicate Rows based on all columns.
If we want to find and select the duplicate, all rows are based on all columns, call the Daraframe.duplicate() without any subset argument. It will return the Boolean series with True at each duplicated row except their first occurrence (default value of keep argument is ”first”). Then pass this Boolean Series to the [] operator of Dataframe to select the duplicate rows.
See the following code.
# app.py import pandas as pd series = [('Stranger Things', 3, 'Millie'), ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'), ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'), ('La Casa De Papel', 4, 'Sergio')] # Create a DataFrame object dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor']) # Find a duplicate rows duplicateDFRow = dfObj[dfObj.duplicated()] print(duplicateDFRow)
Output
python3 app.py Name Seasons Actor 4 Stranger Things 3 Millie 5 La Casa De Papel 4 Sergio
Here all the duplicate rows except their first occurrence are returned because the keep argument’s default value was ”first”.
If we want to select all the duplicate rows except their last occurrence, we must pass a keep argument as ”last”. See the following code.
# app.py import pandas as pd series = [('Stranger Things', 3, 'Millie'), ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'), ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'), ('La Casa De Papel', 4, 'Sergio')] # Create a DataFrame object dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor']) # Find a duplicate rows duplicateDFRow = dfObj[dfObj.duplicated(keep='last')] print(duplicateDFRow)
Output
pyt python3 app.py Name Seasons Actor 0 Stranger Things 3 Millie 2 La Casa De Papel 4 Sergio
Find Duplicate Rows based on selected columns.
If we want to compare rows and find duplicates based on selected columns, we should pass the list of column names in the subset argument of the Dataframe.duplicate() function. Then, it will select & return duplicate rows based on these passed columns only.
For example, let’s find & select rows based on a single column.
# app.py import pandas as pd series = [('Stranger Things', 3, 'Millie'), ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'), ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'), ('La Casa De Papel', 4, 'Sergio')] # Create a DataFrame object dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor']) # Find a duplicate rows duplicateDFRow = dfObj[dfObj.duplicated(['Name'])] print(duplicateDFRow)
Output
pyt python3 app.py Name Seasons Actor 4 Stranger Things 3 Millie 5 La Casa De Papel 4 Sergio
Here rows that have the same value in the ”Name” column are marked as duplicate and returned.
Let’s see another example.
Find and select rows based on two-column names.
# app.py import pandas as pd series = [('Stranger Things', 3, 'Millie'), ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'), ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'), ('La Casa De Papel', 4, 'Sergio')] # Create a DataFrame object dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor']) # Find a duplicate rows duplicateDFRow = dfObj[dfObj.duplicated(['Name', 'Seasons'])] print(duplicateDFRow)
Output
pyt python3 app.py Name Seasons Actor 4 Stranger Things 3 Millie 5 La Casa De Papel 4 Sergio
Conclusion
If you want to find the duplicate rows in Pandas DataFrame, you can use the pandas.dataframe.duplicated() function.
That’s it for this tutorial.