My initial pandas dataframe looks as follows:
df = pd.DataFrame(np.array([[999,888,1,0], [988,899,2,0], [981,821,3,0],[721,789,4,1],[723,745,5,1],[444,432,6,2],[423,412,7,2],[423,455,8,2],[478,432,9,2]]),
columns=['a', 'b', 'c', 'id'])
>>>df
a b c id
999 888 1 0
988 899 2 0
981 821 3 0
721 789 4 1
723 745 5 1
444 432 6 2
423 412 7 2
423 455 8 2
478 432 9 2
Now I want to create a new dataframe with the following structure:
0 1 2 3 4 5
999 988 888 899 1 2
721 723 789 745 4 5
444 423 432 412 6 7
So in fact I just want to get the first two values of each id and each column (a,b,c). Example: first row -> col a: 999, 988 col b: 888, 899 col c: 1, 2
My current approach looks as follows but it seems quite slow and inconvenient:
import pandas as pd
list_to_append_a = []
list_to_append_b = []
for label_id in range(3):
df_new = df.loc[df['id'] == label_id]
a = df_new['a'].head(2).values
b = df_new['b'].head(2).values
list_to_append_a.append(a)
list_to_append_b.append(b)
df_new_a = pd.DataFrame(list_to_append_a)
df_new_b = pd.DataFrame(list_to_append_b)
df_final= pd.concat([df_new_a, df_new_b], axis=1, ignore_index=True)
CodePudding user response:
This could be achived by the following:
pd.DataFrame(df.groupby("id").head(2).set_index('id').values.flatten().reshape((3,4)))
CodePudding user response:
Compute value count of id to get the size (number of unique id, here 3) and the smallest count (here 2 for id=1). Group by id and keep the first count row for each group using head.
Then, use melt to flat your dataframe and sort it by id. Finally, reshape your dataframe according the size variable and create a new dataframe:
size, count = df.value_counts('id').agg(['size', 'min'])
data = df.groupby('id').head(count).melt('id').sort_values('id')['value']
out = pd.DataFrame(data.values.reshape((size, -1)))
print(out)
# Output:
0 1 2 3
0 999 988 888 899
1 721 723 789 745
2 444 423 432 412
