I have a data frame which essentially looks like this:
| number | value |
|---|---|
| 200 | 0 |
| 201 | 1 |
| 202 | 2 |
| .. | .. |
| 399 | 3 |
| 400 | 4 |
What I want to do is to create a new column which has the range of 3 consecutive numbers:
| number | value | range |
|---|---|---|
| 200 | 0 | 200 - 202 |
| 201 | 1 | 200 - 202 |
| 202 | 2 | 200 - 202 |
| .. | .. | .. |
| 399 | 3 | 398 - 400 |
| 400 | 4 | 398 - 400 |
One thing I can do is to create my own function and write if statements like this:
def function(number):
if number < 203 & number > 199:
return "200-202"
elif number < 206 & number > 202:
return "203-205"
....
and so on
But this would require I write about 70 if statements. I'm sure there is an easier way to do this. Can someone please guide me?
CodePudding user response:
You can determine the range from the number itself.
Assuming you want to start on the first value and use ranges of n=3, you can use:
n = 3
first = df['number'].iloc[0] # initial value (could be set to 0 to have fixed ranges)
start = (df['number']
.sub(first).floordiv(n)
.mul(n).add(first)
)
df['range'] = start.astype(str) '-' start.add(n-1).astype(str)
Output:
number value range
0 200 0 200-202
1 201 1 200-202
2 202 2 200-202
3 399 3 398-400
4 400 4 398-400
CodePudding user response:
IIUC, you can floor divide the index, groupby it and transform a lambda that takes first and last number of each group. Note that this doesn't show the same outcome as your expected outcome because 398 is not there (presumably it's in ...).
df['range'] = df['number'].groupby(df.index//3).transform(lambda x: '{}-{}'.format(x.iat[0], x.iat[-1]))
Output:
number value range
0 200 0 200-202
1 201 1 200-202
2 202 2 200-202
3 399 3 399-400
4 400 4 399-400
or
r = df.loc[::3, 'number'].repeat(3).reset_index(drop=True)
df['range'] = r.astype(str) '-' r.add(2).astype(str)
CodePudding user response:
You should use pd.cut
import pandas as pd
import numpy as np
x = pd.DataFrame({
'a': range(200,400),
'b': np.random.randint(0,10,200)
})
wks = 2
x.loc[:,'range'] = pd.cut(x.a, bins=range(x.a.min(), x.a.max() wks, wks), right=False)
with pd.option_context('display.max_rows',5):
display(x)
Output:
a b range
0 200 7 [200, 202)
1 201 6 [200, 202)
... ... ... ...
198 398 1 [398, 400)
199 399 3 [398, 400)
After which I presume you want to do something like:
with pd.option_context('display.max_rows',5):
display(x.groupby('range').b.sum())
Output:
range
[200, 202) 13
[202, 204) 6
..
[396, 398) 6
[398, 400) 4
Name: b, Length: 100, dtype: int32
