I have a super large dataset that i'm trying to shrink. My idea is to keep 100 rows by neighborhood.
Here's an overview of my data :
| index | name | neighborhood |
|---|---|---|
| 0 | name 1 | neighborhood A |
| 1 | name 2 | neighborhood A |
| 2 | name 3 | neighborhood B |
| 3 | name 4 | neighborhood B |
| 4 | name 5 | neighborhood C |
| 5 | name 6 | neighborhood C |
| 6 | name 7 | neighborhood D |
| 7 | name 8 | neighborhood D |
| 8 | name 9 | neighborhood E |
| 9 | name 10 | neighborhood E |
What is the more efficient way to do so ?
Thanks in advance
I'm expecting to create something that looks like :
| index | name | neighborhood |
|---|---|---|
| 0 | name 1 | neighborhood A |
| 1 | name 3 | neighborhood B |
| 2 | name 5 | neighborhood C |
| 3 | name 7 | neighborhood D |
| 4 | name 9 | neighborhood E |
CodePudding user response:
It depends how you want to select the rows.
first n with groupby.head:
n = 100
out = df.groupby('neighborhood').head(n)
random n rows with groupby.sample:
n = 100
out = df.groupby('neighborhood').sample(n=n)
CodePudding user response:
i think, you can use groupby and *nth:
dfx=df.groupby('neighborhood').nth[:100]
