How to consolidate/divide rows within a data frame based on a value within a certain column using pa-CodePudding

The starting df is as follows:

company  metric  time   data
x        X10384  M1     100
x        X10384  M2     100
x        X10384  M3     100
y        X10456  M4     200
y        X10456  M5     200
y        X10456  M6     200

I need to be able to consolidate these rows based on the value of the time dimension. Basically "M1, M2, M3" will encompass Q1 and "M4, M5, M6" will encompass Q2 and so on.

The resulting df will need to be as follows:

company  metric  time   data
x        X10384  Q1     300
y        X10456  Q2     600

Similarly, if starting with a df already in quarters, I will need to be able split the time into months and split the data into an equal three, like in the initial df.

How would one go about trying to transformation this data as above? The below is my starting point for reference:

quarters = ['Q1', 'Q2']
months = ['M1','M2','M3','M4','M5','M6']

for time in df['time']:
    if time in quarters:
        [insert transformation into individual months]
    elif time in months:
        [insert transformation into quarters]

CodePudding user response：

Extract digits from time column then convert them to quarter number. Finally, a simple groupby_sum do the job:

# Convert M1, M2, M3, M4, M5, M6 to Q1, Q1, Q1, Q2, Q2, Q2
to_quarter = df['time'].str[1:].astype(int).floordiv(4).add(1).astype(str).radd('Q')

out = df.assign(time=to_quarter).groupby(['company', 'metric', 'time']) \
                                .sum().reset_index()

Output:

>>> out
  company  metric time  data
0       x  X10384   Q1   300
1       y  X10456   Q2   600

CodePudding user response：

Creating a dataframe based on your data:

data = {'Company' : ['x', 'x', 'x', 'y', 'y', 'y'],  
        'Metric' : ['X10384', 'X10384', 'X10384', 'X10456', 'X10456', 'X10456'],
        'time': ['M1', 'M2', 'M3', 'M4', 'M5', 'M6'],
        'data': [100, 100, 100, 200, 200, 200]}
df = pd.DataFrame(data)

Then create a dictionary and map it per time:

dict = {'M1': 'Q1', 'M2' : 'Q1', 'M3' : "Q1", 'M4' : 'Q2', 'M5' : 'Q2', 'M6' : 'Q2'}
df['time'] = df['time'].map(dict)

And groupby will give you the final result:

df.groupby(['Company','Metric','time']).sum().reset_index()