Home > Software design >  How to create a new column of ranges (about 50 ranges) based on another column
How to create a new column of ranges (about 50 ranges) based on another column

Time:01-28

I have a data frame which essentially looks like this:

number value
200 0
201 1
202 2
.. ..
399 3
400 4

What I want to do is to create a new column which has the range of 3 consecutive numbers:

number value range
200 0 200 - 202
201 1 200 - 202
202 2 200 - 202
.. .. ..
399 3 398 - 400
400 4 398 - 400

One thing I can do is to create my own function and write if statements like this:

def function(number):

    if number < 203 & number > 199:
        return "200-202"
    elif number < 206 & number > 202:
        return "203-205"
    ....
    and so on

But this would require I write about 70 if statements. I'm sure there is an easier way to do this. Can someone please guide me?

CodePudding user response:

You can determine the range from the number itself.

Assuming you want to start on the first value and use ranges of n=3, you can use:

n = 3
first = df['number'].iloc[0]   # initial value (could be set to 0 to have fixed ranges)
start = (df['number']
         .sub(first).floordiv(n)
         .mul(n).add(first)
         )

df['range'] = start.astype(str) '-' start.add(n-1).astype(str)

Output:

   number  value    range
0     200      0  200-202
1     201      1  200-202
2     202      2  200-202
3     399      3  398-400
4     400      4  398-400

CodePudding user response:

IIUC, you can floor divide the index, groupby it and transform a lambda that takes first and last number of each group. Note that this doesn't show the same outcome as your expected outcome because 398 is not there (presumably it's in ...).

df['range'] = df['number'].groupby(df.index//3).transform(lambda x: '{}-{}'.format(x.iat[0], x.iat[-1]))

Output:

   number  value    range
0     200      0  200-202
1     201      1  200-202
2     202      2  200-202
3     399      3  399-400
4     400      4  399-400

or

r = df.loc[::3, 'number'].repeat(3).reset_index(drop=True)
df['range'] = r.astype(str)   '-'   r.add(2).astype(str)

CodePudding user response:

You should use pd.cut

import pandas as pd
import numpy as np
x = pd.DataFrame({
    'a': range(200,400),
    'b': np.random.randint(0,10,200)
})
wks = 2
x.loc[:,'range'] = pd.cut(x.a, bins=range(x.a.min(), x.a.max() wks, wks), right=False)
with pd.option_context('display.max_rows',5):
    display(x)

Output:

    a   b   range
0   200 7   [200, 202)
1   201 6   [200, 202)
... ... ... ...
198 398 1   [398, 400)
199 399 3   [398, 400)

After which I presume you want to do something like:

with pd.option_context('display.max_rows',5):
    display(x.groupby('range').b.sum())

Output:

range
[200, 202)    13
[202, 204)     6
              ..
[396, 398)     6
[398, 400)     4
Name: b, Length: 100, dtype: int32
  •  Tags:  
  • Related