Home > Net >  How to groupby a column and count the number of unique values in another column
How to groupby a column and count the number of unique values in another column

Time:02-08

I have the following dataframe. I need to groupby the ngram, and for each group, count how many unique documents are present in the DocID column.

enter image description here

For example, from the above

4-gram group - 4 as number of unique documents (doc64,doc383,doc76,doc370)
5-gram - 4 
6-gram - 4
7-gram - 2
8-gram - 2

I have an idea in bits. I can get the unique DocIDs as follows:

#Get all the docs of repeated summaries in one list as a list of lists.
rep = []
rep  = temp['DocID'].str.split(",").tolist()

# Put all values in one list.
repSet = []
for i in range(len(rep)):
    repSet.extend(rep[i])

# Remove all duplicates and store in a list.
repSet = list(set(repSet))

But I don't know how to merge this with groupby.

EDIT

I have added the output from the first answer provided. Thank you! But the total number of documents are only 461. So I believe the maximum value of the DocID can go up to only that much :( but for the trigram its above 461 :(

enter image description here

Help will be greatly appreciated. Thanks!

CodePudding user response:

Maybe something like this?

df.assign(docid=df['docid'].str.split(',')).explode('docid').groupby('ngram')['docid'].nunique().reset_index()
  •  Tags:  
  • Related