I working on a Python project that have a dataframe like this:
col = {
'ID': ['123.abc', '123.abc', '123.abc', '456.efg', '456.efg', '123.abc', '123.abc', '789.hij'],
'Content': ['AAA', 'BBB', 'CCC', 'BBB', 'DDD', 'AAA', 'CCC', 'AAA'],
'Visit sequence': ['1', '2', '3', '1', '2', '1', '2', '1'],
'Value': ['0', '0', '5.21', '0', '9.74', '0', '0', '3.83']}
df = pd.DataFrame(col)
which leads to the following data table:
| ID | Content | Visit sequence | Value |
|---|---|---|---|
| 123.abc | AAA | 1 | 0 |
| 123.abc | BBB | 2 | 0 |
| 123.abc | CCC | 3 | 5.21 |
| 456.efg | BBB | 1 | 0 |
| 456.efg | DDD | 2 | 9.74 |
| 123.abc | AAA | 1 | 0 |
| 123.abc | CCC | 2 | 0 |
| 789.hij | AAA | 1 | 3.83 |
The task consists of generating the following dataframe:
| ID | Journey | Value |
|---|---|---|
| 123.abc | AAA, BBB, CCC | 5 |
| 123.abc | AAA, CCC | 0 |
| 456.efg | BBB, DDD | 4 |
| 789.hij | AAA | 1 |
where the Journey column contains a list of Contents found until getting an amount in Value column. Is also considered the journey of contents that do not find amounts in the Value column, as you can see for the case of ID 123.abc.
Could you help me by suggesting a code that solves this task? I kind of got a solution for this task, but since I used a lot of loop functions, I can't apply the algorithm to a larger database because the time processing is too high.
CodePudding user response:
Groupoing by 'ID' and by the cumsum of the non-0 'Value's does the trick:
df2 = (df.groupby([df['ID'], ((df['Value']!='0').shift().fillna(0)).cumsum()], as_index=False)
.agg({'Content' : ','.join, 'Value':'last'})
)
output:
ID Content Value
-- ------- ----------- -------
0 123.abc AAA,BBB,CCC 5.21
1 123.abc AAA,CCC 0
2 456.efg BBB,DDD 9.74
3 789.hij AAA 3.83
CodePudding user response:
To be honest, the value column is not clear to me, but I'll provide my solution, where only the final calculation for this column is missing/not correct. Maybe it's something simple that you can easily adapt:
i = df.ID
df["group"] = i.ne(i.shift()).cumsum()
df2 = df[["ID", "group","Content", "Value"]].groupby(["ID", "group"]). \
agg({'Content' : ','.join,
"Value":"last"}).reset_index()
df2 = df2.rename(columns={"Content":"Journey"}).drop("group", axis=1)
Output:
ID Journey Value
0 123.abc AAA,BBB,CCC 5.21
1 123.abc AAA,CCC 0
2 456.efg BBB,DDD 9.74
3 789.hij AAA 3.83
