I want to create new columns based on a given column value and merge the rows with same subject names within a given date, for example for a given pandas dataframe
| subject | questionid | answer | datetime |
|---|---|---|---|
| First | q1 | fa1 | 2021-02-27T |
| First | q2 | fa2 | 2021-02-27T |
| First | q3 | fa3 | 2021-02-27T |
| First | q4 | fa4 | 2021-02-28T |
| Second | q1 | sa1 | 2021-02-27T |
| Second | q2 | 2021-02-27T | |
| Second | q3 | sa3 | 2021-02-27T |
| Second | q4 | 2021-02-27T |
I want my final dataframe to contain columns based on questionid where each row contains data in each date range i.e. within 24 hours of datetime for a single subject.
**
So, the final output should look like this,
**
| subject | q1 | q2 | q3 | q4 | datetime |
|---|---|---|---|---|---|
| First | fa1 | fa2 | fa3 | nan | 2021-02-27T |
| First | nan | nan | nan | fa4 | 2021-02-28T |
| Second | sa1 | nan | sa3 | nan | 2021-02-27T |
Notice that the first subject has two rows because of different dates the data were collected on.
I was able to create columns based on the questionid in a new dataframe by using
df.pivot(columns='questionid', values='answer')
But the final dataframe looks like this,
| subject | questionid | answer | datetime | q1 | q2 | q3 | q4 |
|---|---|---|---|---|---|---|---|
| First | q1 | fa1 | 2021-02-27T | fa1 | nan | nan | nan |
| First | q2 | fa2 | 2021-02-27T | nan | fa2 | nan | nan |
| First | q3 | fa3 | 2021-02-27T | nan | nan | fa3 | nan |
| First | q4 | fa4 | 2021-02-28T | nan | nan | nan | fa4 |
| Second | q1 | sa1 | 2021-02-27T | sa1 | nan | nan | nan |
| Second | q2 | 2021-02-27T | nan | nan | nan | nan | |
| Second | q3 | sa3 | 2021-02-27T | nan | nan | sa3 | nan |
| Second | q4 | 2021-02-27T | nan | nan | nan | nan |
Is there an easier way without looping through the rows to create the desired output. Again
CodePudding user response:
You're almost there, use both ['subject', 'datetime'] as index in pivot:
(df.pivot(index=['subject', 'datetime'], columns='questionid', values='answer')
.reset_index()
)
output:
subject datetime q1 q2 q3 q4
0 First 2021-02-27T fa1 fa2 fa3 NaN
1 First 2021-02-28T NaN NaN NaN fa4
2 Second 2021-02-27T sa1 NaN sa3 NaN
CodePudding user response:
I think that should do it:
pd.pivot_table(index=['subject', 'datetime'], values='answer', columns='questionid')
