The drop_duplicates() function is one of the general functions in the Pandas library, which is an important function when we work on datasets and analyze the data.
Pandas DataFrame drop_duplicates
Pandas drop_duplicates() function is used in analyzing duplicate data and removing them. The drop_duplicates() function helps in removing duplicates from the DataFrame.
Pandas drop_duplicates() function returns DataFrame with duplicate rows removed.
To remove duplicate rows from the DataFrame, use the Pandas DataFrame drop_duplicates().
Syntax
DataFrame.drop_duplicates(subset=None, keep=’first’, inplace=False)
Parameters
It has the following parameters:
- subset: It takes a column or list of columns. By default, it takes none. After passing columns, it will consider only them for duplicates.
- keep: It is to control how to consider duplicate values. It can have 3 values. ‘y default is ‘first’. The meaning of all the ‘three values are:
- 1 – ‘first’ – It considers the first value as unique and the rest of the same values as duplicates.
- 2 – ‘last’ – It considers the last value as unique and the rest of the same value as duplicates.
- 3 – False – If false, it considers all of the same values as duplicates.
- inplace: It takes boolean values and removes rows with duplicates if True.
Return Value
The drop_duplicates() function returns the DataFrame with removed duplicate rows or None if inplace=True.
Example program on drop_duplicates()
Write a program to show the working of drop_duplicates().
import pandas as pd data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Karan"], "Age": [21, 23, 31, 23],"City":["Patna","Kolkata","Mumbai","Kolkata"]} df = pd.DataFrame(data_dict) print(df) df2 = df.drop_duplicates() print("\n After removal of duplicate rows:\n") print(df2)
Output
Name Age City 0 Rohit 21 Patna 1 Karan 23 Kolkata 2 Shivam 31 Mumbai 3 Karan 23 Kolkata After removal of duplicate rows: Name Age City 0 Rohit 21 Patna 1 Karan 23 Kolkata 2 Shivam 31 Mumbai
In the above example, we can see that we have 2 repeated rows with the student name Karan. Hence after using drop_duplicates(), we can remove the duplicate row.
Example 2: Write a program to remove duplicates from a particular column using drop_duplicates().
See the following code.
import pandas as pd data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit"], "Age": [21, 23, 31, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata"]} df = pd.DataFrame(data_dict) print(df) df.drop_duplicates(subset="City", keep=False, inplace=True) print("\nDataFrame after removing students belonging to same city:\n", df)
Output
Name Age City 0 Rohit 21 Patna 1 Karan 23 Kolkata 2 Shivam 31 Mumbai 3 Ajit 23 Kolkata DataFrame after removing students belonging to same city: Name Age City 0 Rohit 21 Patna 2 Shivam 31 Mumbai
In the above example, we can see Karan and Ajitwe’veng to the same city, Kolkata.
Hence, we’ve dropped” duplicates considering a single column, “City”, and ignoring all other facts. After that, we printed the resultant DataFrame.
Drop duplicates and keep the last row
To keep the last row and remove all the other duplicate rows, use keep=‘last‘ as an argument.
import pandas as pd data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit", "Ajit"], "Age": [21, 23, 31, 23, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata", "Kolkata"]} df = pd.DataFrame(data_dict) print(df) df.drop_duplicates(subset="City", keep='last', inplace=True) print("\nDataFrame after removing students belonging to same city:\n", df)
Output
Name Age City 0 Rohit 21 Patna 1 Karan 23 Kolkata 2 Shivam 31 Mumbai 3 Ajit 23 Kolkata 4 Ajit 23 Kolkata DataFrame after removing students belonging to same city: Name Age City 0 Rohit 21 Patna 2 Shivam 31 Mumbai 4 Ajit 23 Kolkata
In this example, we are removing the rows based o City.
In our DataFrame, city = ‘Kolkata‘ appears three times so that it will remove two won’t, and we have passed keep=last so that it won’t remove the last row.
You can see that index rows 1 and 3 are removed, and the 4th row is not because of keep=’last‘.
Remove Duplicate Rows based on Specific Columns
We must pass the list subset parameters to remove duplicate rows based on specific columns. The list values are column names; you don’t pass the single column, you don’t have to give the list, but if you have multiple columns, you need to provide the list containing the column names.
import pandas as pd data_dict = {"Name": ["Rohit", "Karan", "Shivam", "Ajit", "Ajit"], "Age": [21, 23, 31, 23, 23], "City": ["Patna", "Kolkata", "Mumbai", "Kolkata", "Kolkata"]} df = pd.DataFrame(data_dict) print(df) df.drop_duplicates(subset=["City", "Age"], keep='last', inplace=True) print("\nDataFrame after removing students belonging to same city and age:\n", df)
Output
Name Age City 0 Rohit 21 Patna 1 Karan 23 Kolkata 2 Shivam 31 Mumbai 3 Ajit 23 Kolkata 4 Ajit 23 Kolkata DataFrame after removing students belonging to same city: Name Age City 0 Rohit 21 Patna 2 Shivam 31 Mumbai 4 Ajit 23 Kolkata
In this example, we have passed two columns; based on those columns; we will remove the duplicate rows. We have taken Age and City as column names and removed the rows based on these column values.
Conclusion
The drop_dupDataFrame’surns only the DataFrame’s unique values. If you want to get a distinct row from DataFrame, use the df.drop_duplicates() method.