How can I count the number of elements within a range in each row-CodePudding

I want to count the number of elements in a row that fall within a range and then print a new column with the results. After looking around I came up with the following solution, however the results are not consistent. Is the solution too simplistic for what I want to accomplish? I have the following DataFrame.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))

def val_1(row):
    val_1 = 0
    if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(1,3):
        val_1 = 3

    elif row.loc['A'] and row.loc['B'] in range(1,3):
        val_1 = 2
    elif row.loc['A'] in range(1,3):
        val_1 = 1
    return val_1

def val_2(row):
    val_2 = 0
    if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(3,6) :
        val_2 = 3
    elif row.loc['A'] and row.loc['B'] in range(3,6) :
        val_2 = 2
    elif row.loc['A'] in range(3,6) :
        val_2 = 1
    return val_2

def val_3(row):
    val_3 = 0
    if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(6,10) :
        val_3 = 3
    elif row.loc['A'] and row.loc['B'] in range(6,10) :
        val_3 = 2
    elif row.loc['A']in range(6,10) :
        val_3 = 1
    return val_3

def results():
    df['Val_1'] = df.apply(val_1, axis=1)
    df['Val_2'] = df.apply(val_2, axis=1)
    df['Val_3'] = df.apply(val_3, axis=1)
    print(df)

results()

   A  B  C  Val_1  Val_2  Val_3
0  9  0  0      0      0      1
1  6  1  0      2      0      1
2  8  5  5      0      3      1
3  9  7  0      0      0      2
4  4  6  2      3      1      2
5  1  5  5      1      3      0
6  8  1  7      2      0      3
7  4  8  5      0      3      2
8  0  6  0      0      0      0
9  3  0  3      0      1      0

Thanks for your help.

CodePudding user response：

This is shorter version of your code. The problem is in your code:

It is checking the range only for last element
row.loc['A'] and row.loc['B'] and row.loc["C"] in range(start,end)
Only checking range logic for C column (Not for A & B)

def get_val(row,start,end):
    val = 0
    if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(start,end) :
        val = 3
    elif row.loc['A'] and row.loc['B'] in range(start,end) :
        val = 2
    elif row.loc['A'] in range(start,end) :
        val = 1
    return val

df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))
df['Val_1'] = df.apply(lambda x:get_val(x,1,3), axis=1)
df['Val_2'] = df.apply(lambda x:get_val(x,3,6), axis=1)
df['Val_3'] = df.apply(lambda x:get_val(x,6,10), axis=1)

My Solution

def query_build(start, end):
    query1 = f'A>={start} and A<={end} and B>={start} and B<={end} and C>={start} and C<={end}'
    query2 = f'A>={start} and A<={end} and B>={start} and B<={end}'
    query3 = f'A>={start} and A<={end}'

    return {query1: 3,
            query2: 2,
            query3: 1}

df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))
df['val1'] = 0
df['val2'] = 0
df['val3'] = 0

val_range = {'val1':(1,2),'val2':(3,5),'val3':(6,9)}

for name, r_range in val_range.items():
    query_set = query_build(*r_range)
    for query, val in query_set.items():
        length = len(df.query(query))
        if length:
            df[name][df.query(query).index] = val
print(df)

CodePudding user response：

Another approach which is using numpy.select (recommended in the Pandas User Guide here):

import numpy as np

for n, (start, stop) in enumerate([(1, 3), (3, 6), (6, 10)], start=1):
    m = df.isin(range(start, stop))
    condlist = [m.all(axis=1), m[["A", "B"]].all(axis=1), m["A"]]
    df[f"Val_{n}"] = np.select(condlist, [3, 2, 1])

For each iteration build a mask m on df that checks if the df values are in the resp. range(start, stop).
Based on m build a condition list condlist: The 1. entry checks if all values are in the range, the 2. checks if the values in columns A and B are in the range, and the 3. checks if the value in column A is in the range - all checks are done row-wise.
Based on condlist set the corresponding values from [3, 2, 1] (called choicelist in the numpy.select docs) in the new column, and 0 (standard default) if no condition is met.

The selection follows the preferences in you code - see the numpy.select documentation:

When multiple conditions are satisfied, the first one encountered in condlist is used.

Your question actually sounds a bit different to what you are doing:

I want to count the number of elements in a row that fall within a range and then print a new column with the results.

If that is your real goal, then you could try something simpler:

for n, (start, stop) in enumerate([(1, 3), (3, 6), (6, 10)], start=1):
    df[f"Val_{n}"] = df.isin(range(start, stop)).sum(axis=1)

Why your approach fails (in addition to @Mazhar's explanation): This

if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(1,3):

isn't how logical operators like and work: You have to fully specify each part of the condition, like:

if (row.loc['A'] in range(1,3)) and (row.loc['B'] in range(1,3)) and (row.loc["C"] in range(1,3)):

The way you have used it actually resolves to

if bool(row.loc['A']) and bool(row.loc['B']) and row.loc["C"] in range(1,3):

which here means (since the values are numbers)

if (row.loc['A'] != 0) and (row.loc['B'] != 0) and (row.loc["C"] in range(1,3)):