AppDividend
Latest Code Tutorials

Pandas dropna Function | df.dropna() In Python Pandas

0

Pandas dropna() is an inbuilt DataFrame function that is used to remove rows and columns with Null/None/NA values from DataFrame. Pandas dropna() method returns the new DataFrame, and the source DataFrame remains unchanged. We can create the null values using None, pandas.NaT, and numpy.nan properties.

Pandas dropna() Function

Python Pandas dropna() method allows the user to analyze and drop Rows/Columns with Null values in different ways. The function is beneficial while we are importing CSV data into DataFrame. The CSV file has null values, which are later displayed as NaN in Data Frame.

Syntax

DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)
  1. axis: The possible values are {0 or ‘index’, 1 or ‘columns’}, default value is 0. If 0, drop rows with null values. If 1, drop columns with missing values.
  2. how: The possible values are {‘any’, ‘all’}, default ‘any’. If ‘any’, drop the row/column if any of the values are null. If ‘all’, drop the row/column if all the values are missing.
  3. thresh: It is an int value to specify the threshold for the drop operation.
  4. subset: It specifies the rows/columns to look for null values.
  5. inplace: It is a boolean value. If it is True, then the source DataFrame is changed, and None is returned.

df.dropna Example

Let’s create a DataFrame in which we will put the np.nan, pd.NaT and None values.

# app.py

import pandas as pd
import numpy as np

# reading the data
series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', np.nan, 'Emilia'),
          ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', 4, None)]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
print(dfObj)

Output

python3 app.py
               Name Seasons        Actor
0   Stranger Things       3       Millie
1   Game of Thrones     NaN       Emilia
2         Westworld     NaT  Evan Rachel
3  La Casa De Papel       4         None

Now, we want to remove the NaN, NaT, and None values from DataFrame using df.dropna() function.

See the following code.

# app.py

import pandas as pd
import numpy as np

# reading the data
series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', np.nan, 'Emilia'),
          ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', 4, None)]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
print('Before removing all NaT, None, and NaN rows')
print(dfObj)
removedNone = dfObj.dropna()
print('After removing all NaT, None, and NaN rows')
print(removedNone)

Output

python3 app.py
Before removing all NaT, None, and NaN rows
               Name Seasons        Actor
0   Stranger Things       3       Millie
1   Game of Thrones     NaN       Emilia
2         Westworld     NaT  Evan Rachel
3  La Casa De Papel       4         None
After removing all NaT, None, and NaN rows
              Name Seasons   Actor
0  Stranger Things       3  Millie

Pandas dropna() function returns DataFrame with NA entries dropped from it.

Pandas: Drop All Columns with Any Missing Value

We can pass axis = 1 to drop all columns with the missing values.

See the following code.

# app.py

import pandas as pd
import numpy as np

# reading the data
series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', np.nan, 'Emilia'),
          ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', 4, None)]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
print('Before removing all NaT, None, and NaN columns')
print(dfObj)
removedNoneColumns = dfObj.dropna(axis=1)
print('After removing all NaT, None, and NaN columns')
print(removedNoneColumns)

Output

python3 app.py
Before removing all NaT, None, and NaN columns
               Name Seasons        Actor
0   Stranger Things       3       Millie
1   Game of Thrones     NaN       Emilia
2         Westworld     NaT  Evan Rachel
3  La Casa De Papel       4         None
After removing all NaT, None, and NaN columns
               Name
0   Stranger Things
1   Game of Thrones
2         Westworld
3  La Casa De Papel

If it finds any column with minimum one NaN, None, or NaT values, then it will remove that column. We have passed axis = 1, which means remove any column which has minimum one of these values: NaN, None, or NaT values.

Pandas: Drop the rows if all elements are missing

If we pass the how=’all’ parameter, then it will remove the row if all the values are either None, NaN, or NaT.

See the following code.

# app.py

import pandas as pd
import numpy as np

# reading the data
series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', np.nan, 'Emilia'),
          ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', 4, None)]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
print('Before dropping')
print(dfObj)
removedNoneColumns = dfObj.dropna(how='all')
print('Drop the rows where all elements are missing')
print(removedNoneColumns)

Output

python3 app.py
Before dropping
               Name Seasons        Actor
0   Stranger Things       3       Millie
1   Game of Thrones     NaN       Emilia
2         Westworld     NaT  Evan Rachel
3  La Casa De Papel       4         None
Drop the rows where all elements are missing
               Name Seasons        Actor
0   Stranger Things       3       Millie
1   Game of Thrones     NaN       Emilia
2         Westworld     NaT  Evan Rachel
3  La Casa De Papel       4         None

From the output, we can see that the dropna() function does not remove any single row because not a single row has all the None, NaN, or NaT values.

So, we have dropped Row/Column Only if All the Values are Null.

Pandas: Drop only those rows with minimum 2 NA values

Pandas dropna(thresh=2) function drops only those rows which have a minimum of 2 NA values.

Let’s modify the existing row, which has a minimum of 2 NA values, and apply the thresh=2 argument to see the desired output.

# app.py

import pandas as pd
import numpy as np

# reading the data
series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', np.nan, 'Emilia'),
          ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', None, None)]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
print('Before dropping')
print(dfObj)
removedNoneColumns = dfObj.dropna(thresh=2)
print('Drop only those rows which have minimum 2 NA values')
print(removedNoneColumns)

Here, DataFrame’s last row has 2 None values. So, after applying the dropna(thresh=2) function, it should remove that row from DataFrame. See the following output.

python3 app.py
Before dropping
               Name Seasons        Actor
0   Stranger Things       3       Millie
1   Game of Thrones     NaN       Emilia
2         Westworld     NaT  Evan Rachel
3  La Casa De Papel    None         None
Drop only those rows which have minimum 2 NA values
              Name Seasons        Actor
0  Stranger Things       3       Millie
1  Game of Thrones     NaN       Emilia
2        Westworld     NaT  Evan Rachel

Pandas: Define Labels to look for null values

Let’s define columns in which they are looking for missing values.

# app.py

import pandas as pd
import numpy as np

# reading the data
series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', np.nan, 'Emilia'),
          ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', 4, None)]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
print('Before dropping')
print(dfObj)
removeDefinedColumns = dfObj.dropna(subset=['Name', 'Actor'])
print('Drop only those rows, whose column names are defined in subset')
print(removeDefinedColumns)

Output

pyt python3 app.py
Before dropping
               Name Seasons        Actor
0   Stranger Things       3       Millie
1   Game of Thrones     NaN       Emilia
2         Westworld     NaT  Evan Rachel
3  La Casa De Papel       4         None
Drop only those rows, whose column names are defined in subset
              Name Seasons        Actor
0  Stranger Things       3       Millie
1  Game of Thrones     NaN       Emilia
2        Westworld     NaT  Evan Rachel

From the output, you can see that only the last row satisfies our condition, that is why it has removed.

Pandas: Keep the DataFrame with valid entries in the same variable.

The dropna(inplace=True) keeps the DataFrame with valid entries in the same variable.

See the following code.

# app.py

import pandas as pd
import numpy as np

# reading the data
series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', np.nan, 'Emilia'),
          ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', 4, None)]

# Create a DataFrame object
dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
dfObj.dropna(inplace=True)
print(dfObj)

Output

python3 app.py
              Name Seasons   Actor
0  Stranger Things       3  Millie

We have passed inplace = True to change the source DataFrame itself. It’s useful when the DataFrame size is enormous, and we want to save some memory.

Conclusion

If you want to drop rows with NaN Values in Pandas DataFrame or drop based on some conditions, then use the dropna() method. You just need to pass different parameters based on your requirements while removing the entire rows and columns.

See also

Pandas read_csv()

Pandas set_index()

Pandas boolean indexing

Pandas iloc[]

Pandas value_counts()

Leave A Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.