in pandas , i can run the code like:
day = ['day1','day2','day3','day4','day1','day2','day3','day1','day2']
code = ["a","a","a","a","b","b","b","c","c"]
price = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame({"date":day,"code":code,"price":price})
df['codeindex'] = df.groupby('code')['date'].transform(lambda x: range(0, len(x), 1))
in pandas i can generate such a column 'codeindex' using transform. how can i do that in py-polars?
thx a lot!
CodePudding user response:
If you need that exact functionality - you're doing the equivalent of .cumcount()
>>> df.groupby('code').cumcount()
0 0
1 1
2 2
3 3
4 0
5 1
6 2
7 0
8 1
dtype: int64
You can achieve the same result in polars with:
df.with_column(
df.groupby('code', maintain_order=True)
.agg(pl.col('date').cumcount())['date'].alias('codeindex').explode()
)
result:
shape: (9, 4)
┌──────┬──────┬───────┬───────────┐
│ date ┆ code ┆ price ┆ codeindex │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ u32 │
╞══════╪══════╪═══════╪═══════════╡
│ day1 ┆ a ┆ 1 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ a ┆ 2 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ a ┆ 3 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day4 ┆ a ┆ 4 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ b ┆ 6 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ b ┆ 7 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day1 ┆ c ┆ 8 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ c ┆ 9 ┆ 1 │
└──────┴──────┴───────┴───────────┘
CodePudding user response:
Your dataframe is ordered by the code column. This means you can use a window expression.
A windows expression takes an aggregation, like col("date").cumcount() and applies that over a group defined by .over("code").
The code looks like this:
day = ['day1','day2','day3','day4','day1','day2','day3','day1','day2']
code = ["a","a","a","a","b","b","b","c","c"]
price = [1,2,3,4,5,6,7,8,9]
df = pl.DataFrame({"date":day,"code":code,"price":price})
(df.select([
pl.all(),
pl.col("date").cumcount().over("code").flatten().alias("codeindex"),
]))
outputs
shape: (9, 4)
┌──────┬──────┬───────┬───────────┐
│ date ┆ code ┆ price ┆ codeindex │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ u32 │
╞══════╪══════╪═══════╪═══════════╡
│ day1 ┆ a ┆ 1 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ a ┆ 2 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ a ┆ 3 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day4 ┆ a ┆ 4 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ b ┆ 6 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day3 ┆ b ┆ 7 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day1 ┆ c ┆ 8 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ day2 ┆ c ┆ 9 ┆ 1 │
└──────┴──────┴───────┴───────────┘
Disect
Note that if we don't flatten we would get lists with cumcounts for every group. Like this
df.select([
pl.col("date").cumcount().over("code").alias("codeindex"),
]))
shape: (9, 1)
┌───────────────┐
│ codeindex │
│ --- │
│ list [u32] │
╞═══════════════╡
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1] │
└───────────────┘
Flatten in window functions.
In window functions a flatten takes the first value per group and explodes/flattens that. That result is stacked back to the original DataFrame.
For this output to make sense, you often want the DataFrame to be sorted by the columns you group by (in the over clause).
This is a gotcha to remember, but doing so allows us to make the window operations very fast.
