Create a new df in pandas from existing one?-CodePudding

I am doing a project on movies data.
The sample dataset looks like :

The column genres have 21 unique values.

I want to create a new table/dataframe so that the table will contains the average ratings for the each genres for every user, like

I got the list of genres using the code below:

def split(sent):
    return (sent.split())

new_genres=set()

for i in range(len(genres)):
    a=split(genres[i])
    for g in a:
        new_genres.add(g)

new_genres

CodePudding user response：

Setup:

In [905]: df = pd.DataFrame({'userID':[1,2,3,3,2], 'id':[110, 147, 858, 1246, 1968], 'rating':[1.0, 4.5, 5.0, 5.0, 4.0], 'genres':['Drama Mystery Romance', 'Drama', 'Comedy Drama Romance', 'Drama', 'Drama Comedy Romance']}
     ...: )

In [906]: df
Out[906]: 
   userID    id  rating                 genres
0       1   110     1.0  Drama Mystery Romance
1       2   147     4.5                  Drama
2       3   858     5.0   Comedy Drama Romance
3       3  1246     5.0                  Drama
4       2  1968     4.0   Drama Comedy Romance

Use Series.str.split, df.explode with df.pivot_table:

In [907]: df['genres'] = df['genres'].str.split()
In [910]: res = df.explode('genres').pivot_table(index='userID', columns='genres', values='rating', aggfunc='mean').fillna(0, downcast='infer')

In [911]: res
Out[911]: 
genres  Comedy  Drama  Mystery  Romance
userID                                 
1            0   1.00        1        1
2            4   4.25        0        4
3            5   5.00        0        5

CodePudding user response：

We can start by using the assign method to get each genre in rows like so :

>>> df = df.assign(genre=df['genres'].str.split(' ')).explode('genre')
>>> df
    userId  id      rating  genres                  genre
0   1       110     1.0     Drama Mystery Romance   Drama
0   1       110     1.0     Drama Mystery Romance   Mystery
0   1       110     1.0     Drama Mystery Romance   Romance
1   1       147     4.5     Drama                   Drama
2   1       858     5.0     Comedy Drama Romance    Comedy
2   1       858     5.0     Comedy Drama Romance    Drama
2   1       858     5.0     Comedy Drama Romance    Romance
3   1       1246    5.0     Drama                   Drama
4   1       1968    4.0     Drama Comedy Romance    Drama
4   1       1968    4.0     Drama Comedy Romance    Comedy
4   1       1968    4.0     Drama Comedy Romance    Romance
5   270896  48780   5.0     Forein                  Forein
6   270896  49530   4.0     Action Thriller Scifi   Action
6   270896  49530   4.0     Action Thriller Scifi   Thriller
6   270896  49530   4.0     Action Thriller Scifi   Scifi
7   270896  54001   4.0     Drama                   Drama
8   270896  54503   4.0     Action Forein           Action
8   270896  54503   4.0     Action Forein           Forein
9   270896  58559   5.0     Drama                   Drama

Then we groupby userId and genre to get the mean of rating:

>>> df_grouped = df.groupby(['userId', 'genre'])['rating'].mean()
>>> df_grouped
userId  genre   
1       Comedy      4.500000
        Drama       3.900000
        Mystery     1.000000
        Romance     3.333333
270896  Action      4.000000
        Drama       4.500000
        Forein      4.500000
        Scifi       4.000000
        Thriller    4.000000
Name: rating, dtype: float64

To finish, we can use unstack to get the expected result :

>>> df_grouped.unstack(level=-1).fillna(0)
genre   Action  Comedy  Drama   Forein  Mystery     Romance     Scifi   Thriller
userId                              
1       0.0     4.5     3.9     0.0     1.0         3.333333    0.0     0.0
270896  4.0     0.0     4.5     4.5     0.0         0.000000    4.0     4.0