I have a four-column data frame, given as follows: Column zero consists of text labels chosen from a list ['A','B','C','D'] with possible repetitions. Columns one-two are labelled, start and stop, where the former is less than the latter, and column three, intensity, is a float. For each label, none of the corresponding intervals formed using [start,stop] overlap.
A simple example is given by:
import numpy as np
import pandas as pd
labels=['A','B','C','D']
d = {'label': ['A','B','A','C','D','B','A'],'start': [1, 2,6,4,1,8,12], 'stop':
[4,4,9,6,7,11,16],'intensity':[8,2,4,6,7,1,5]}
df = pd.DataFrame(data=d)
print(df)
label start stop intensity
0 A 1 4 8
1 B 2 4 2
2 A 6 9 4
3 C 4 6 6
4 D 1 7 7
5 B 8 11 1
6 A 12 16 5
I wish to create a matrix, M, having four (=len(labels)) rows and 16 columns. (The number of columns must be at least the maximum entry in df['stop']. Whether it's larger doesn't matter). For each integer k between 0 and 6, the index of df['label'][k] in labels specifies a row of my matrix M. The entries in columns d[start][k] to d[stop][k] of this row should all equal d['intensity'][k]. All other entries of M equal zero.
For example, label A corresponds to rows 0, 2, and 6. In row 0, entries in columns 1-4 equal 8, entries in columns 6-9 equal 4, and entries in columns 12-16 equal 5.
I'd like to do this in the most pythonic way using list operations and at most one loop.
CodePudding user response:
Here's a solution:
MAX = df['stop'].max()
new_df = pd.DataFrame(df.groupby('label').apply(lambda g: sum(g.apply(lambda x: np.isin(np.arange(MAX), np.arange(x['start']-1, x['stop'])).astype(int)*x['intensity'], axis=1))).tolist(), index=labels)
Output:
>>> new_df
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A 8 8 8 8 0 4 4 4 4 0 0 5 5 5 5 5
B 0 2 2 2 0 0 0 1 1 1 1 0 0 0 0 0
C 0 0 0 6 6 6 0 0 0 0 0 0 0 0 0 0
D 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0
CodePudding user response:
another way using explode
df['range'] = df.apply(lambda r: list(range(r['start'], r['stop'] 1)), axis=1)
df.explode('range').set_index(['label', 'range'])[['intensity']].unstack()
