Pandas DataFrame drop_duplicates: The Complete Guide

The drop_duplicates() function is one of the general functions in the Pandas library, which is an important function when we work on datasets and analyze the data.

Pandas DataFrame drop_duplicates

Pandas drop_duplicates() function is used in analyzing duplicate data and removing them. The drop_duplicates() function helps in removing duplicates from the DataFrame.

Pandas drop_duplicates() function returns DataFrame with duplicate rows removed.

To remove duplicate rows from the DataFrame, use the Pandas DataFrame drop_duplicates().

Syntax

DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters

It has the following parameters: 

  • subset: It takes a column or list of columns. By default, it takes none. After passing columns, it will consider only them for duplicates.
  • keep: It is to control how to consider duplicate values. It can have 3 values. ‘y default is ‘first’. The meaning of all the ‘three values are:
    • 1 – ‘first’ – It considers the first value as unique and the rest of the same values as duplicates.
    • 2 – ‘last’ – It considers the last value as unique and the rest of the same value as duplicates.
    • 3 – False – If false, it considers all of the same values as duplicates.
  • inplace: It takes boolean values and removes rows with duplicates if True.

Return Value

The drop_duplicates() function returns the DataFrame with removed duplicate rows or None if inplace=True.

Example program on drop_duplicates()

Write a program to show the working of drop_duplicates().

import pandas as pd

data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Karan"],  
"Age": [21, 23, 31, 23],"City":["Patna","Kolkata","Mumbai","Kolkata"]}  
df = pd.DataFrame(data_dict)  
print(df) 
df2 = df.drop_duplicates()  
print("\n After removal of duplicate rows:\n")
print(df2)

Output

 Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai
3   Karan   23  Kolkata

 After removal of duplicate rows:

     Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai

In the above example, we can see that we have 2 repeated rows with the student name Karan. Hence after using drop_duplicates(), we can remove the duplicate row.

Example 2: Write a program to remove duplicates from a particular column using drop_duplicates().

See the following code.

import pandas as pd

data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit"],
             "Age": [21, 23, 31, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata"]}
df = pd.DataFrame(data_dict)
print(df)
df.drop_duplicates(subset="City",
                   keep=False, inplace=True)
print("\nDataFrame after removing students belonging to same city:\n", df)

Output

 Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai
3    Ajit   23  Kolkata

DataFrame after removing students belonging to same city:
      Name  Age    City
0   Rohit   21   Patna
2  Shivam   31  Mumbai

In the above example, we can see Karan and Ajitwe’veng to the same city, Kolkata.

Hence, we’ve dropped” duplicates considering a single column, City”, and ignoring all other facts. After that, we printed the resultant DataFrame.

Drop duplicates and keep the last row

To keep the last row and remove all the other duplicate rows, use keep=last‘ as an argument.

import pandas as pd

data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit", "Ajit"],
             "Age": [21, 23, 31, 23, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata", "Kolkata"]}
df = pd.DataFrame(data_dict)
print(df)
df.drop_duplicates(subset="City",
                   keep='last', inplace=True)
print("\nDataFrame after removing students belonging to same city:\n", df)

Output

    Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai
3    Ajit   23  Kolkata
4    Ajit   23  Kolkata

DataFrame after removing students belonging to same city:
      Name  Age     City
0   Rohit   21    Patna
2  Shivam   31   Mumbai
4    Ajit   23  Kolkata

In this example, we are removing the rows based o City. 

In our DataFrame, city = ‘Kolkata‘ appears three times so that it will remove two won’t, and we have passed keep=last so that it won’t remove the last row.

You can see that index rows 1 and 3 are removed, and the 4th row is not because of keep=’last‘.

Remove Duplicate Rows based on Specific Columns

We must pass the list subset parameters to remove duplicate rows based on specific columns. The list values are column names; you don’t pass the single column, you don’t have to give the list, but if you have multiple columns, you need to provide the list containing the column names.

import pandas as pd

data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit", "Ajit"],
             "Age": [21, 23, 31, 23, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata", "Kolkata"]}
df = pd.DataFrame(data_dict)
print(df)
df.drop_duplicates(subset=["City", "Age"],
                   keep='last', inplace=True)
print("\nDataFrame after removing students belonging to same city and age:\n", df)

Output

  Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai
3    Ajit   23  Kolkata
4    Ajit   23  Kolkata

DataFrame after removing students belonging to same city:
      Name  Age     City
0   Rohit   21    Patna
2  Shivam   31   Mumbai
4    Ajit   23  Kolkata

In this example, we have passed two columns; based on those columns; we will remove the duplicate rows. We have taken Age and City as column names and removed the rows based on these column values.

Conclusion

The drop_dupDataFrame’surns only the DataFrame’s unique values. If you want to get a distinct row from DataFrame, use the df.drop_duplicates() method.

See also

Pandas DataFrame transform

Pandas DataFrame rank

Pandas DataFrame apply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.