Finding which rows have duplicates in a .csv, but only if they have a certain amount of duplicates-CodePudding

I am trying to determine which sequential rows have at least 50 duplicates within one column. Then I would like to be able to read which rows have the duplicates in a summarized manner, ie

    start end total
    9     60  51
    200   260 60

I'm trying to keep the start and end separate so I can call on them independently later. I have this to open the .csv file and read its contents:

    df = pd.read_csv("BN4 A4-F4, H4_row1_column1_watershed_label.csv", header=None)
    
    df.groupby(0).filter(lambda x: len(x) > 0)

Which gives me this:

0
0   52.0
1   65.0
2   52.0
3   52.0
4   52.0
... ...
4995    8.0
4996    8.0
4997    8.0
4998    8.0
4999    8.0
5000 rows × 1 columns

I'm having a number of problems with this. 1) I'm not sure I totally understand the second function. It seems like it is supposed to group the numbers in my column together. This code:

    df.groupby(0).count()

gives me this:

    0
    0.0
    1.0
    2.0
    3.0
    4.0
    ...
    68.0
    69.0
    70.0
    71.0
    73.0
    65 rows × 0 columns

Which I assume means that there are a total of 65 different unique identities in my column. This just doesn't tell me what they are or where they are. I thought that's what this one would do

    df.groupby(0).filter(lambda x: len(x) > 0)

but if I change the 0 to anything else then it screws up my generated list.

Problem 2) I think in order to get the number of duplicates in a sequence, and which rows they are in, I would probably need to use a for loop, but I'm not sure how to build it. So far, I've been pulling my hair out all day trying to figure it out but I just don't think I know Python well enough yet.

Can I get some help, please?

CodePudding user response：

Let's start by mocking a df:

import numpy as np
np.random.seed(314)
df=pd.DataFrame({0:np.random.randint(10,size = 5000)})
# make sure we have a couple of large blocks
df.loc[300:400,0] = 5 
df.loc[600:660,0] = 4

First we identify where the changes to the consecutive numbers occur, and groupby each of such groups. We record where it starts, where it finishes, and the size of each group

df2 = (df.reset_index()
        .groupby((df[0].diff() != 0).cumsum())
        .agg({'index':['first','last',len]})
        )

Then we only pick those groups that are longer than 50

(df2.where(df2[('index','len')]>50)
    .dropna()
    .astype(int)
    .reset_index(drop = True)
    )

output:


    index
    first   last    len
0   300     400     101
1   600     660     61

For your question as to what df.groupby(0).filter(lambda x: len(x) > 0) does, as far as I can tell it does nothing. It groups by different values in column 0 and then discard those groups whose size is 0, which is none of them by definition. So this returns your full df

Edit

Your code is not quite right, should be

def behavior():
    df2 = (df.reset_index()
        .groupby((df[0].diff() != 0).cumsum())
        .agg({0 : 'mean', 'index':['first','last',len]}))
    df3 = (df2.where(df2[('index','len')]>50)
        .dropna()
        .astype(int)
        .reset_index(drop = True))
    print(df3)

note that we define and return df3 not df2, and also I amended the code to return the value that is repeated in the mean column (sorry names are not very intuitive but you can change them if you want)

first is the index when the repetition starts, last is the last index, and len is how many elements there.

CodePudding user response：

Thanks! So this is what I have:

in:
df = pd.read_csv("BN4 A4-F4, H4_row1_column1_watershed_label.csv", header=None)

def behavior():
    df2 = (df.reset_index()
        .groupby((df[0].diff() != 0).cumsum())
        .agg({'index':['first','last',len]}))
    (df2.where(df2[('index','len')]>50)
        .dropna()
        .astype(int)
        .reset_index(drop = True))
    print(df2)

out:
    index          
     first  last len
0                   
1        0     0   1
2        1     1   1
3        2     5   4
4        6     6   1
5        7     7   1
...    ...   ...  ..
2547  4901  4901   1
2548  4902  4902   1
2549  4903  4906   4
2550  4907  4943  37
2551  4944  4999  56

[2551 rows x 3 columns]

So here's my understanding, please tell me how I'm right or wrong. The first numbers on the left are the rows of the new generated sheet, right? Is there a way to use the values that are duplicated and have them in those rows as well? So it would look something like this: Say that '5' is the duplicate from 4903-4906, '18' from 4907-4943, and '22' from 4944-4999.

     out:
     first last   len  duplicate
2547  4901  4901   1       5
2548  4902  4902   1
2549  4903  4906   4
2550  4907  4943  37      18
2551  4944  4999  56      22