Split Pandas Dataframe Using Criteria-CodePudding

I have time series data with a column that sums up seconds that something is running. All numbers are divisible by 30s but sometimes it does skip numbers (may jump from 30 to 90). This column can reset along as it is running, setting the start count back to 30s. How would I break up every chunk of runtime.

For example: If numbers in the column are 30, 60, 120, 150, 30, 60, 90, 30, 60, how would I break apart the dataframe into the full sequences with no resets.

30, 60, 120, 150 in 1 dataframe and 30, 60, 90 in the next and 30, 60 in the last? At the end, I need to take the max of each dataframe and add them together (that part I could figure out).

CodePudding user response：

Not very elegant but it get's the job done.

Loop through an array. Add array to a list when a number is smaller than the one before. Remove the saved array from the list and repeat until no change is detected.

import numpy as np
a = np.array([30, 60, 120, 150, 30, 60, 90, 30, 60])
y = []

def split(a,y):
    for count,val in enumerate(a):
        if count == 0:
            pass
        elif val < a[count-1]:
            y.append(a[:count])
            a =  a[count:]
            if len(a)> 0 and sorted(a) != list(a):
                split(a,y)
            else:
                y.append(a)
                a = []
                return(y)
        
            return(y)

y = split(a,y)
print(y)
>>[array([ 30,  60, 120, 150]), array([30, 60, 90]), array([30, 60])]
print([max(lis) for lis in y])
>>[150,90,60]

This will not consider 30 as a starting point but the samllest number after the reset.

Or using diff to find the changepoints.

import numpy as np

a = np.array([30, 60, 120, 150, 30, 60, 90, 30, 60])

y = []

def split(a,y):
    
    a_diff = np.asarray(np.where(np.diff(a)<0))[0]
    
    while len(a_diff)>1:
        
        a_diff = np.asarray(np.where(np.diff(a)<0))[0] 
        
        y.append(a[:a_diff[0] 1])
        
        a = a[a_diff[0] 1:]
    
    y.append(a)
    
    return(y)    

y = split(a,y)

print(y)

print([max(lis) for lis in y])

>>[array([ 30,  60, 120, 150]), array([30, 60, 90]), array([30, 60])]

>>[150, 90, 60]

And a version with pandas and DataFrames

import pandas as pd
df = pd.DataFrame({'data': [30, 60, 120, 150, 30, 60, 90, 30, 60]})
y = []
def split(df,y):
    a = df['data']
    a_diff = [count for count,val in enumerate(a.diff()[1:])  if val < 0 ]
    while len(a_diff)>1:
        a_diff = [count for count,val in enumerate(a.diff()[1:])  if val < 0 ]
        y.append(a[:a_diff[0] 1])
        a = a[a_diff[0] 1:]
    y.append(a)
    return(y)  

y = split(df,y)
print(y)
print([max(lis) for lis in y])

CodePudding user response：

Using @RSale's input:

import pandas as pd
df = pd.DataFrame({'data': [30, 60, 120, 150, 30, 60, 90, 30, 60]})

d = dict(tuple(df.groupby(df['data'].eq(30).cumsum())))

d is a dictionary of three dataframes:

d[1]:

d[2]:

And d[3}:

   data
7    30
8    60