AppDividend
Latest Code Tutorials

Pandas DataFrame drop_duplicates() Function Example

0

Pandas drop_duplicates() function is used in analyzing duplicate data and removing them. The function basically helps in removing duplicates from the DataFrame. It is one of the general functions in the Pandas library which is an important function when we work on datasets and analyze the data.

Understand Pandas DataFrame drop_duplicates()

Pandas drop_duplicates() function returns DataFrame with duplicate rows removed.

To remove duplicate rows from the DataFrame, use the Pandas DataFrame drop_duplicates().

Syntax

DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)

Parameters

It has the following parameters: 

  • subset: It takes a column or list of columns. By default, it takes none. After passing columns, it will consider only them for duplicates.
  • keep: It is to control how to consider duplicate values. It can have 3 values. By default, it is ‘first’. Meaning of all the three values are:
    • 1 – ‘first’ – It considers first value as unique and the rest of the same values as duplicates.
    • 2 – ‘last’ – It considers last value as unique and rest of the same value as duplicates.
    • 3 – False – If false it considers all of the same values as duplicates.
  • inplace: It takes boolean values and it removes rows with duplicates if True.

Return Value

The drop_duplicates() function returns the DataFrame with removed duplicate rows or None if inplace=True.

Example program on drop_duplicates()

Write a program to show the working of drop_duplicates().

import pandas as pd

data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Karan"],  
"Age": [21, 23, 31, 23],"City":["Patna","Kolkata","Mumbai","Kolkata"]}  
df = pd.DataFrame(data_dict)  
print(df) 
df2 = df.drop_duplicates()  
print("\n After removal of duplicate rows:\n")
print(df2)

Output

 Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai
3   Karan   23  Kolkata

 After removal of duplicate rows:

     Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai

In the above example, we can see that we have 2 repeated rows with student name as Karan. Hence after using drop_duplicates() we are able to remove the duplicate row.

Example 2: Write a program to remove duplicates from a particular column using drop_duplicates().

See the following code.

import pandas as pd

data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit"],
             "Age": [21, 23, 31, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata"]}
df = pd.DataFrame(data_dict)
print(df)
df.drop_duplicates(subset="City",
                   keep=False, inplace=True)
print("\nDataFrame after removing students belonging to same city:\n", df)

Output

 Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai
3    Ajit   23  Kolkata

DataFrame after removing students belonging to same city:
      Name  Age    City
0   Rohit   21   Patna
2  Shivam   31  Mumbai

Here in the above example, we can see that Karan and Ajit are belonging from the same city Kolkata.

Hence we’ve dropped duplicates considering a single column which is “City” and ignoring all other facts. After that, we printed the resultant DataFrame.

Drop duplicates and keep the last row

To keep the last row and remove all the other duplicate rows, use keep=last‘ as an argument.

import pandas as pd

data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit", "Ajit"],
             "Age": [21, 23, 31, 23, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata", "Kolkata"]}
df = pd.DataFrame(data_dict)
print(df)
df.drop_duplicates(subset="City",
                   keep='last', inplace=True)
print("\nDataFrame after removing students belonging to same city:\n", df)

Output

    Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai
3    Ajit   23  Kolkata
4    Ajit   23  Kolkata

DataFrame after removing students belonging to same city:
      Name  Age     City
0   Rohit   21    Patna
2  Shivam   31   Mumbai
4    Ajit   23  Kolkata

In this example, we are removing the rows based on City. 

In our DataFrame, city = ‘Kolkata‘ appears three times, so it will remove two rows and we have passed keep=last, so it won’t remove the last row.

You can see that index row 1 and 3 are removed and 4th row is not because of keep=’last‘.

Remove Duplicate Rows based on Specific Columns

To remove duplicate rows based on specific columns, we have to pass the list subset parameters. The values of the list are column names.

If you pass the single column then you don’t have to pass the list but if you have multiple columns then you do need to pass the list containing the column names.

import pandas as pd

data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit", "Ajit"],
             "Age": [21, 23, 31, 23, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata", "Kolkata"]}
df = pd.DataFrame(data_dict)
print(df)
df.drop_duplicates(subset=["City", "Age"],
                   keep='last', inplace=True)
print("\nDataFrame after removing students belonging to same city and age:\n", df)

Output

  Name  Age     City
0   Rohit   21    Patna
1   Karan   23  Kolkata
2  Shivam   31   Mumbai
3    Ajit   23  Kolkata
4    Ajit   23  Kolkata

DataFrame after removing students belonging to same city:
      Name  Age     City
0   Rohit   21    Patna
2  Shivam   31   Mumbai
4    Ajit   23  Kolkata

In this example, we have passed two columns, and based on those columns, we will remove the duplicate rows. We have taken Age and City as column names and remove the rows based on these column values.

Conclusion

The drop_duplicates returns only the DataFrame’s unique values. If you want to get a distinct row from DataFrane then use the df.drop_duplicates() method.

See also

Pandas DataFrame transform()

Pandas DataFrame rank()

Pandas DataFrame apply()

Leave A Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.