How to create a pandas dataframe that contains ordered lists based on analysis conditions applied on-CodePudding

I working on a Python project that have a dataframe like this:

col = {
        'ID': ['123.abc', '123.abc', '123.abc', '456.efg', '456.efg', '123.abc', '123.abc', '789.hij'],
        'Content': ['AAA', 'BBB', 'CCC', 'BBB', 'DDD', 'AAA', 'CCC', 'AAA'],
        'Visit sequence': ['1', '2', '3', '1', '2', '1', '2', '1'],
        'Value': ['0', '0', '5.21', '0', '9.74', '0', '0', '3.83']}
df = pd.DataFrame(col)

which leads to the following data table:

ID	Content	Visit sequence	Value
123.abc	AAA	1	0
123.abc	BBB	2	0
123.abc	CCC	3	5.21
456.efg	BBB	1	0
456.efg	DDD	2	9.74
123.abc	AAA	1	0
123.abc	CCC	2	0
789.hij	AAA	1	3.83

The task consists of generating the following dataframe:

ID	Journey	Value
123.abc	AAA, BBB, CCC	5
123.abc	AAA, CCC	0
456.efg	BBB, DDD	4
789.hij	AAA	1

where the Journey column contains a list of Contents found until getting an amount in Value column. Is also considered the journey of contents that do not find amounts in the Value column, as you can see for the case of ID 123.abc.

Could you help me by suggesting a code that solves this task? I kind of got a solution for this task, but since I used a lot of loop functions, I can't apply the algorithm to a larger database because the time processing is too high.

CodePudding user response：

Groupoing by 'ID' and by the cumsum of the non-0 'Value's does the trick:

df2 = (df.groupby([df['ID'], ((df['Value']!='0').shift().fillna(0)).cumsum()], as_index=False)
    .agg({'Content' : ','.join, 'Value':'last'})
)

output:

    ID       Content        Value
--  -------  -----------  -------
 0  123.abc  AAA,BBB,CCC     5.21
 1  123.abc  AAA,CCC         0
 2  456.efg  BBB,DDD         9.74
 3  789.hij  AAA             3.83

CodePudding user response：

To be honest, the value column is not clear to me, but I'll provide my solution, where only the final calculation for this column is missing/not correct. Maybe it's something simple that you can easily adapt:

i = df.ID    
df["group"] = i.ne(i.shift()).cumsum()
df2 = df[["ID", "group","Content", "Value"]].groupby(["ID", "group"]). \
    agg({'Content' : ','.join, 
         "Value":"last"}).reset_index()

df2 = df2.rename(columns={"Content":"Journey"}).drop("group", axis=1)

Output:

    ID              Journey         Value
0   123.abc         AAA,BBB,CCC     5.21
1   123.abc         AAA,CCC         0
2   456.efg         BBB,DDD         9.74
3   789.hij         AAA             3.83