Home > Software engineering >  How to represent as a matrix a 4-column dataframe, where column 0 specifies row, columns 1-2 specify
How to represent as a matrix a 4-column dataframe, where column 0 specifies row, columns 1-2 specify

Time:02-10

I have a four-column data frame, given as follows: Column zero consists of text labels chosen from a list ['A','B','C','D'] with possible repetitions. Columns one-two are labelled, start and stop, where the former is less than the latter, and column three, intensity, is a float. For each label, none of the corresponding intervals formed using [start,stop] overlap.

A simple example is given by:

import numpy as np
import pandas as pd
labels=['A','B','C','D']
d = {'label': ['A','B','A','C','D','B','A'],'start': [1, 2,6,4,1,8,12], 'stop': 
[4,4,9,6,7,11,16],'intensity':[8,2,4,6,7,1,5]}
df = pd.DataFrame(data=d)
print(df)

  label  start  stop  intensity
0     A      1     4          8
1     B      2     4          2
2     A      6     9          4
3     C      4     6          6
4     D      1     7          7
5     B      8    11          1
6     A     12    16          5

I wish to create a matrix, M, having four (=len(labels)) rows and 16 columns. (The number of columns must be at least the maximum entry in df['stop']. Whether it's larger doesn't matter). For each integer k between 0 and 6, the index of df['label'][k] in labels specifies a row of my matrix M. The entries in columns d[start][k] to d[stop][k] of this row should all equal d['intensity'][k]. All other entries of M equal zero.

For example, label A corresponds to rows 0, 2, and 6. In row 0, entries in columns 1-4 equal 8, entries in columns 6-9 equal 4, and entries in columns 12-16 equal 5.

I'd like to do this in the most pythonic way using list operations and at most one loop.

CodePudding user response:

Here's a solution:

MAX = df['stop'].max()
new_df = pd.DataFrame(df.groupby('label').apply(lambda g: sum(g.apply(lambda x: np.isin(np.arange(MAX), np.arange(x['start']-1, x['stop'])).astype(int)*x['intensity'], axis=1))).tolist(), index=labels)

Output:

>>> new_df
    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
A   8   8   8   8   0   4   4   4   4   0   0   5   5   5   5   5
B   0   2   2   2   0   0   0   1   1   1   1   0   0   0   0   0
C   0   0   0   6   6   6   0   0   0   0   0   0   0   0   0   0
D   7   7   7   7   7   7   7   0   0   0   0   0   0   0   0   0

CodePudding user response:

another way using explode

df['range'] = df.apply(lambda r: list(range(r['start'], r['stop'] 1)), axis=1)
df.explode('range').set_index(['label', 'range'])[['intensity']].unstack()
  •  Tags:  
  • Related