my.remarkbox.com

unverified 7y, 7d ago

This is a very complicated use of Pandas! It is much easier to do aggregation and grouping in pandas than how this article is written.

remark

andre 7y, 6d ago

Some of the later use-cases are more complex, but it can't get much simpler than df.groupby('species').agg('mean')! :)

remark

KPUsiTIU 7y, 7d ago [edited]

I'm pretty sure there is an as_index=false parameter that you can add to a group by operation to keep everything as columns.

remark

andre 7y, 6d ago [edited]

There is indeed! However, when using multiple aggregation functions it still leads to indexed groups.

remark

2yScKKE1 6y, 200d ago

Hi Andre, great article!

I have following warning with code in Single Grouping Column, Custom Aggregation section.

gdf = df.groupby('species').agg({
    'sepal width' : {
        'width min': 'min',
        'width max': 'max'
    },
    'sepal length' : ['max', 'mean', percentile(20)]
})

throws:FutureWarning: using a dict with renaming is deprecated and will be removed in a future version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)

I however have workaround yet a bit clumsy.

def my_agg(df):
    x = df.groupby('species')
    names = {
        'width max': x['sepal width'].max(),
        'width min': x['sepal width'].min(),
        'max length': x['sepal length'].max(),
        'mean length':x['sepal length'].mean(),
        'length percentile':x['sepal length'].apply(percentile(20))     
    }
    return pd.DataFrame(names)

Is it alright that way?

remark

andre 6y, 200d ago [edited]

That's unfortunate, isn't it?

As far as your code goes, I would wager that doing aggregations manually will give you a performance penalty (which may not matter to you unless your dataset is huge) since Pandas tends to perform these operations using carefully optimized methods.

Another thing you can do is give your grouped DataFrame a dictionary of lists, e.g.,

df.groupby('species').agg({ 
  'sepal width': ['min', 'max'],
  'sepal length': ['min', 'max'],
})

and then rename the resulting DataFrame manually...

remark