I have the following issue. I have a data frame like this:
| ID | feature |
|---|---|
| Person_1 | 18 |
| Person_1 | 19 |
| Person_1 | 23 |
| Person_1 | 59 |
| Person_2 | 11 |
| Person_2 | 23 |
| Person_2 | 59 |
| Person_3 | 11 |
| Person_3 | 18 |
| Person_3 | 1001 |
| Person_3 | 1239 |
| Person_4 | 23 |
| Person_4 | 6531 |
| Person_4 | 19843 |
| Person_4 | 200012 |
| …… | |
| Person_60 | …. |
Each feature is in a new row. I have a list of features that I could have:
| features |
|---|
| 11 |
| 18 |
| 19 |
| 23 |
| 59 |
| 1001 |
| 1239 |
| 6531 |
| 19843 |
| 200012 |
I need the output to be like that:
| 11 | 18 | 19 | 23 | 59 | 1001 | 1239 | 6531 | 19843 | 200012 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Person_1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| Person_2 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| Person_3 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| Person_4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
When each person is in a row, their features are assigned based on the list of features.
I've tried something like this, but it's not even close.
for i in pd.DataFrame[~ df.duplicated(subset=['id'])]:
for Feature in feature_list:
if feature_list in df['feature'].unique():
print('1')
else:
print('0')
I'm a bit lost. How to approach the problem could you help me with that?
Thank you very much
CodePudding user response:
There's a number of ways you could do this. Here's one way.
Stating with
df = pd.DataFrame([
["Person_1", 1],
["Person_1", 2],
["Person_2", 1],
["Person_3", 3],
], columns=["ID", "feature"])
which looks like
ID feature
0 Person_1 1
1 Person_1 2
2 Person_2 1
3 Person_3 3
you should use a groupby and unstack:
df = df.groupby(["ID", "feature"]).size().unstack(fill_value=0).reset_index()
which yields
feature ID 1 2 3
0 Person_1 1 1 0
1 Person_2 1 0 0
2 Person_3 0 0 1
