I have a table like so after performing some data scraping on a pdf:
index colA colB colC colD colE colF colG
-------------------------------------------------------------------
1 ABCD veryLongTextThatShouldNotCutOff 12 x x x x
2 ABCD veryLongText 14 x x x x
4 NaN ThatShouldNotCutOff NaN NaN NaN NaN NaN
5 ABCD veryLongTextThatShouldNotCutOff 12 x x x x
6 ABCD veryLongTextThatShouldNotCutOff x 19 x x x
7 ABCD veryLongText 21 x x x x
9 NaN ThatShouldNotCutOff NaN NaN NaN NaN NaN
10 ABCD veryLongTextThatShouldNotCutOff 13 x x x x
...
I need to clean the data up so that information in rows with index 2 and 4 are merged together like so, and with the indexes in the right order again.
index colA colB colC colD colE colF colG
-------------------------------------------------------------------
1 ABCD veryLongTextThatShouldNotCutOff 12 x x x x
2 ABCD veryLongTextThatShouldNotCutOff 14 x x x x
3 ABCD veryLongTextThatShouldNotCutOff 12 x x x x
4 ABCD veryLongTextThatShouldNotCutOff x 19 x x x
5 ABCD veryLongTextThatShouldNotCutOff 21 x x x x
6 ABCD veryLongTextThatShouldNotCutOff 13 x x x x
...
Essentially, some rows with colA as NaN needs to be "concatenated" with the previous rows.
How do I do it using pandas?
CodePudding user response:
You could use the non NaN values in colA to set up a group and merge the colB. Then drop the NaN rows:
group = df['colA'].notna().cumsum()
df['colB'] = df.groupby(group)['colB'].transform('sum')
df = df.dropna(subset=['colA'])
output:
index colA colB colC colD colE colF colG
0 1 ABCD veryLongTextThatShouldNotCutOff 12.0 x x x x
1 2 ABCD veryLongTextThatShouldNotCutOff 14.0 x x x x
3 5 ABCD veryLongTextThatShouldNotCutOff 12.0 x x x x
