How to Find Duplicate Rows in Pandas DataFrame

0
482
Python Pandas - Find Duplicate Rows In DataFrame Based On All Or Selected Columns

If you want to find duplicate rows in a DataFrame based on all or selected columns, use the pandas.dataframe.duplicated() function. In Data Science, sometimes, you get a messy dataset. For example, you may have to deal with duplicates, which will skew your analysis.

Pandas duplicate rows

To find duplicate rows in Pandas DataFrame, use the pd.df.duplicated() function. Pandas.DataFrame.duplicated() is a library function that finds duplicate rows based on all or specific columns. The pd.duplicated() function returns a Boolean Series with a True value for each duplicated row.

Syntax

The syntax of pandas.dataframe.duplicated() function is following.

DataFrame.duplicated(subset=None, keep='first')

Parameters

  • subset :
    • Single or multiple column labels should be used for duplication checks. If you do not provide them, then all the columns will be checked for finding duplicate rows.
  • keep :
    • It denotes the occurrence, which should be marked as duplicate. Its value can be {”first”, ”last”, False}, default value is ”first”.
      • first: All the duplicates except their first occurrence will be marked as True.
      • last: All the duplicates except their last occurrence will be marked as True.
      • False: All the duplicates except will be marked as True.

Example

Let’s create a sample DataFrame that contains duplicate values.

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
print(dfObj)

Output

 python3 app.py
               Name  Seasons        Actor
0   Stranger Things        3       Millie
1   Game of Thrones        8       Emilia
2  La Casa De Papel        4       Sergio
3         Westworld        3  Evan Rachel
4   Stranger Things        3       Millie
5  La Casa De Papel        4       Sergio

As you can see, the above dataframe contains duplicate rows.

Find Duplicate Rows based on all columns.

If we want to find and select the duplicate, all rows are based on all columns, call the Daraframe.duplicate() without any subset argument. It will return the Boolean series with True at each duplicated row except their first occurrence (default value of keep argument is ”first”). Then pass this Boolean Series to the [] operator of Dataframe to select the duplicate rows.

See the following code.

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# Find a duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated()]
print(duplicateDFRow)

Output

python3 app.py
               Name  Seasons   Actor
4   Stranger Things        3  Millie
5  La Casa De Papel        4  Sergio

Here all the duplicate rows except their first occurrence are returned because the keep argument’s default value was ”first”.

If we want to select all the duplicate rows except their last occurrence, we must pass a keep argument as ”last”. See the following code.

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# Find a duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated(keep='last')]
print(duplicateDFRow)

Output

pyt python3 app.py
               Name  Seasons   Actor
0   Stranger Things        3  Millie
2  La Casa De Papel        4  Sergio

Find Duplicate Rows based on selected columns.

If we want to compare rows and find duplicates based on selected columns, we should pass the list of column names in the subset argument of the Dataframe.duplicate() function. Then, it will select & return duplicate rows based on these passed columns only.

For example, let’s find & select rows based on a single column.

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# Find a duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated(['Name'])]
print(duplicateDFRow)

Output

 pyt python3 app.py
               Name  Seasons   Actor
4   Stranger Things        3  Millie
5  La Casa De Papel        4  Sergio

Here rows that have the same value in the ”Name” column are marked as duplicate and returned.

Let’s see another example.

Find and select rows based on two-column names.

# app.py

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# Find a duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated(['Name', 'Seasons'])]
print(duplicateDFRow)

Output

pyt python3 app.py
               Name  Seasons   Actor
4   Stranger Things        3  Millie
5  La Casa De Papel        4  Sergio

Conclusion

If you want to find the duplicate rows in Pandas DataFrame, you can use the pandas.dataframe.duplicated() function.

That’s it for this tutorial.

See also

Pandas set_index()

Pandas sort_values()

Pandas boolean_indexing()

Pandas value_counts()

Leave A Reply

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.