Pandas groupby() method groups DataFrame or Series objects based on specific criteria. Therefore, it can be useful for performing aggregation and transformation operations on the grouped data. The method returns a GroupBy object, which can be used to apply various aggregation functions like sum(), mean(), count(), and many more.
Syntax
DataFrame.groupby(by=None, axis=0, level=None, as_index=True,
sort=True, group_keys=True, squeeze=False, **kwargs)
Parameters
The groupby() function contains 7 parameters.
- by: It determines the groups for the groupby() function. Its default value is none. It is the mapping function.
- axis: It takes integer values; by default, it is 0.
- level: If the axis is a hierarchical MultiIndex, the grouping is done by a particular level or multiple levels.
- as_index: It is of the Boolean data type. We return the object with group labels as the index for aggregated output. It is only relevant for DataFrame input.
- sort: Sort group keys. We get better performance by turning this off.
- group_keys: It is also of Boolean data type and has the value true by default. When calling apply, add group keys to the index to identify pieces.
- Squeeze: By default, it is also of the Boolean data type, False. It reduces the dimensionality of the return type if possible. Otherwise, it returns a consistent type.
Return Value
The groupby() function returns a groupby object that contains information about the different groups.
Example 1
import pandas as pd
dataset = {
'Name': ['Rohit', 'Arun', 'Sohit', 'Arun', 'Shubh'],
'Roll no': ['01', '02', '03', '04', '05'],
'maths': ['93', '63', '74', '94', '83'],
'science': ['88', '55', '66', '94', '35'],
'english': ['93', '74', '84', '92', '87']}
df = pd.DataFrame(dataset)
by_name = df.groupby(['Name'])
print(by_name)
Output
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10e965250>
In the output, what is that DataFrameGroupBy thing? It is a .__str__() that doesn’t give you much information about what it is or how it works.
The DataFrameGroupBy object can be challenging to wrap your head around because it’s lazy. It doesn’t do any operations to produce a helpful result until you say so.
One term frequently used alongside the .groupby() method is split-apply-combine. This refers to the chain of the following three steps:
- First, split a DataFrame into groups.
- Apply some operations to each of those smaller DataFrames.
- Combine the results.
It can be challenging to inspect df.groupby(“Name”) because it does virtually nothing until you do something with a resulting object. Again, the Pandas GroupBy object is lazy. It delays almost any part of the split-apply-combine process until you call a method.
Example 2
import pandas as pd
dataset = {
'Name': ['Rohit', 'Arun', 'Sohit', 'Arun', 'Shubh'],
'Roll no': ['01', '02', '03', '04', '05'],
'maths': ['93', '63', '74', '94', '83'],
'science': ['88', '55'a '66', '94', '35'],
'english': ['93', '74', '84', '92', '87']}
df = pd.DataFrame(dataset)
by_name = df.groupby(['Name'])
for Name, maths in by_name:
print(f"First 2 entries for {Name!r}")
print("------------------------")
print(maths.head(2), end="\n\n")
Output
First 2 entries for 'Arun' ------------------------ Name Roll no maths science english 1 Arun 02 63 55 74 3 Arun 04 94 94 92 First 2 entries for 'Rohit' ------------------------ Name Roll no maths science english 0 Rohit 01 93 88 93 First 2 entries for 'Shubh' ------------------------ Name Roll no maths science english 4 Shubh 05 83 35 87 First 2 entries for 'Sohit' ------------------------ Name Roll no maths science english 2 Sohit 03 74 66 84
If you’re working on the difficult aggregation problem, iterating over a Pandas GroupBy object can be a considerable way to visualize a split part of split-apply-combine.
Very few other methods and properties let you look into the individual groups and their splits. For example, the .groups attribute will give you the dictionary of {group Name: group label} pairs.
That’s it.