I am working on the MovieLens dataset using python/pandas which I am new to, so please bear with me. I was asked to keep only the movies with at least 5 reviews. It is supposed to count the occurrences of the movies' IDs and if there are less than 5 get rid of the rows with those movies' IDs. I have written the code below that changes nothing in the final ratings dataset and I don't know why.
ratings = pd.read_csv(ratings_path)
ratings_final = ratings.copy()
counts = dict()
for i in ratings.index:
if ratings.loc[i, 'movieId'] not in counts:
counts[ratings.loc[i, 'movieId']] = 1
else:
counts[ratings.loc[i, 'movieId']] = counts[ratings.loc[i, 'movieId']] 1
for i in ratings.index:
if counts[ratings.loc[i, 'movieId']] < 5:
print("I'm here", ratings.loc[i, 'movieId'])
ratings_final=ratings_final.drop([i])
However the following code works
ratings_final=ratings.drop([10, 12, 100833])
print(ratings_final)
Is there something wrong with my loop or is my thinking completely wrong? What should I do to solve this? Thanks!
CodePudding user response:
I would use groupby()
rating_final = rating.groupby('movieId')['UserId'].nunique().reset_index(name="count")
rating_final = rating_final[rating_final["count"]>=5]
Here we group by movieId, so we basically aggregate. Based on that aggregation we need to perform an operation on the rows we want to put together. We want to check the movie that got more than ratings. I assumed ratings meant different users, so we select the userId column and count the number of unique elements in it.
From then we just select the movieId that have at least 5 unique elements for userId.
CodePudding user response:
You can groupby movieId and then filter the rating columns with count (number of occurrences) >=5. This will keep all rows that have >= 5 ratings from the ratings df.
ratings_final = ratings.groupby('movieId').filter(lambda x: x['rating'].count() >= 5)
