Home > OS >  How to create a pandas dataframe that contains ordered lists based on analysis conditions applied on
How to create a pandas dataframe that contains ordered lists based on analysis conditions applied on

Time:01-26

I working on a Python project that have a dataframe like this:

col = {
        'ID': ['123.abc', '123.abc', '123.abc', '456.efg', '456.efg', '123.abc', '123.abc', '789.hij'],
        'Content': ['AAA', 'BBB', 'CCC', 'BBB', 'DDD', 'AAA', 'CCC', 'AAA'],
        'Visit sequence': ['1', '2', '3', '1', '2', '1', '2', '1'],
        'Value': ['0', '0', '5.21', '0', '9.74', '0', '0', '3.83']}
df = pd.DataFrame(col)

which leads to the following data table:

ID Content Visit sequence Value
123.abc AAA 1 0
123.abc BBB 2 0
123.abc CCC 3 5.21
456.efg BBB 1 0
456.efg DDD 2 9.74
123.abc AAA 1 0
123.abc CCC 2 0
789.hij AAA 1 3.83

The task consists of generating the following dataframe:

ID Journey Value
123.abc AAA, BBB, CCC 5
123.abc AAA, CCC 0
456.efg BBB, DDD 4
789.hij AAA 1

where the Journey column contains a list of Contents found until getting an amount in Value column. Is also considered the journey of contents that do not find amounts in the Value column, as you can see for the case of ID 123.abc.

Could you help me by suggesting a code that solves this task? I kind of got a solution for this task, but since I used a lot of loop functions, I can't apply the algorithm to a larger database because the time processing is too high.

CodePudding user response:

Groupoing by 'ID' and by the cumsum of the non-0 'Value's does the trick:

df2 = (df.groupby([df['ID'], ((df['Value']!='0').shift().fillna(0)).cumsum()], as_index=False)
    .agg({'Content' : ','.join, 'Value':'last'})
)

output:

    ID       Content        Value
--  -------  -----------  -------
 0  123.abc  AAA,BBB,CCC     5.21
 1  123.abc  AAA,CCC         0
 2  456.efg  BBB,DDD         9.74
 3  789.hij  AAA             3.83

CodePudding user response:

To be honest, the value column is not clear to me, but I'll provide my solution, where only the final calculation for this column is missing/not correct. Maybe it's something simple that you can easily adapt:

i = df.ID    
df["group"] = i.ne(i.shift()).cumsum()
df2 = df[["ID", "group","Content", "Value"]].groupby(["ID", "group"]). \
    agg({'Content' : ','.join, 
         "Value":"last"}).reset_index()

df2 = df2.rename(columns={"Content":"Journey"}).drop("group", axis=1)

Output:

    ID              Journey         Value
0   123.abc         AAA,BBB,CCC     5.21
1   123.abc         AAA,CCC         0
2   456.efg         BBB,DDD         9.74
3   789.hij         AAA             3.83
  •  Tags:  
  • Related