I want to count the number of elements in a row that fall within a range and then print a new column with the results. After looking around I came up with the following solution, however the results are not consistent. Is the solution too simplistic for what I want to accomplish? I have the following DataFrame.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))
def val_1(row):
val_1 = 0
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(1,3):
val_1 = 3
elif row.loc['A'] and row.loc['B'] in range(1,3):
val_1 = 2
elif row.loc['A'] in range(1,3):
val_1 = 1
return val_1
def val_2(row):
val_2 = 0
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(3,6) :
val_2 = 3
elif row.loc['A'] and row.loc['B'] in range(3,6) :
val_2 = 2
elif row.loc['A'] in range(3,6) :
val_2 = 1
return val_2
def val_3(row):
val_3 = 0
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(6,10) :
val_3 = 3
elif row.loc['A'] and row.loc['B'] in range(6,10) :
val_3 = 2
elif row.loc['A']in range(6,10) :
val_3 = 1
return val_3
def results():
df['Val_1'] = df.apply(val_1, axis=1)
df['Val_2'] = df.apply(val_2, axis=1)
df['Val_3'] = df.apply(val_3, axis=1)
print(df)
results()
A B C Val_1 Val_2 Val_3
0 9 0 0 0 0 1
1 6 1 0 2 0 1
2 8 5 5 0 3 1
3 9 7 0 0 0 2
4 4 6 2 3 1 2
5 1 5 5 1 3 0
6 8 1 7 2 0 3
7 4 8 5 0 3 2
8 0 6 0 0 0 0
9 3 0 3 0 1 0
Thanks for your help.
CodePudding user response:
This is shorter version of your code. The problem is in your code:
- It is checking the range only for last element
row.loc['A'] and row.loc['B'] and row.loc["C"] in range(start,end)- Only checking range logic for
Ccolumn (Not forA&B)
def get_val(row,start,end):
val = 0
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(start,end) :
val = 3
elif row.loc['A'] and row.loc['B'] in range(start,end) :
val = 2
elif row.loc['A'] in range(start,end) :
val = 1
return val
df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))
df['Val_1'] = df.apply(lambda x:get_val(x,1,3), axis=1)
df['Val_2'] = df.apply(lambda x:get_val(x,3,6), axis=1)
df['Val_3'] = df.apply(lambda x:get_val(x,6,10), axis=1)
My Solution
def query_build(start, end):
query1 = f'A>={start} and A<={end} and B>={start} and B<={end} and C>={start} and C<={end}'
query2 = f'A>={start} and A<={end} and B>={start} and B<={end}'
query3 = f'A>={start} and A<={end}'
return {query1: 3,
query2: 2,
query3: 1}
df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))
df['val1'] = 0
df['val2'] = 0
df['val3'] = 0
val_range = {'val1':(1,2),'val2':(3,5),'val3':(6,9)}
for name, r_range in val_range.items():
query_set = query_build(*r_range)
for query, val in query_set.items():
length = len(df.query(query))
if length:
df[name][df.query(query).index] = val
print(df)
CodePudding user response:
Another approach which is using numpy.select (recommended in the Pandas User Guide here):
import numpy as np
for n, (start, stop) in enumerate([(1, 3), (3, 6), (6, 10)], start=1):
m = df.isin(range(start, stop))
condlist = [m.all(axis=1), m[["A", "B"]].all(axis=1), m["A"]]
df[f"Val_{n}"] = np.select(condlist, [3, 2, 1])
- For each iteration build a mask
mondfthat checks if thedfvalues are in the resp.range(start, stop). - Based on
mbuild a condition listcondlist: The 1. entry checks if all values are in the range, the 2. checks if the values in columnsAandBare in the range, and the 3. checks if the value in columnAis in the range - all checks are done row-wise. - Based on
condlistset the corresponding values from[3, 2, 1](calledchoicelistin thenumpy.selectdocs) in the new column, and0(standarddefault) if no condition is met.
The selection follows the preferences in you code - see the numpy.select documentation:
When multiple conditions are satisfied, the first one encountered in condlist is used.
Your question actually sounds a bit different to what you are doing:
I want to count the number of elements in a row that fall within a range and then print a new column with the results.
If that is your real goal, then you could try something simpler:
for n, (start, stop) in enumerate([(1, 3), (3, 6), (6, 10)], start=1):
df[f"Val_{n}"] = df.isin(range(start, stop)).sum(axis=1)
Why your approach fails (in addition to @Mazhar's explanation): This
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(1,3):
isn't how logical operators like and work: You have to fully specify each part of the condition, like:
if (row.loc['A'] in range(1,3)) and (row.loc['B'] in range(1,3)) and (row.loc["C"] in range(1,3)):
The way you have used it actually resolves to
if bool(row.loc['A']) and bool(row.loc['B']) and row.loc["C"] in range(1,3):
which here means (since the values are numbers)
if (row.loc['A'] != 0) and (row.loc['B'] != 0) and (row.loc["C"] in range(1,3)):
