I have a df as below format.
year type1 type2 price
2015 apple natural 40
2015 apple organic 35
2016 apple natural 44
2016 apple organic 40
2015 banana natural 20
2015 banana organic 15
2016 banana natural 20
2016 banana organic 18
I need to create a new column price_new when the year, type1 and type2 conditions are met. In other words for the same year and type1 if the type2 is natural then fill the new column with new value or else print the old value.
I tried the below:
df["price_new"] = np.where(((df["year"] == 2015) & (
df["type1"] == "apple") & (df["type2"].isin(['natural']))),
25, df["price"])
df["price_new"] = np.where(((df["year"] == 2016) & (
df["type1"] == "apple") & (df["type2"].isin(['natural']))),
26, df["price"])
df["price_new"] = np.where(((df["year"] == 2015) & (
df["type1"] == "apple") & (~df["type2"].isin(['natural']))),
20, df["price"])
df["price_new"] = np.where(((df["year"] == 2016) & (
df["type1"] == "apple") & (~df["type2"].isin(['natural']))),
22, df["price"])
The output should be like below:
year type1 type2 price price_new
2015 apple natural 40 25
2015 apple organic 35 20
2016 apple natural 44 26
2016 apple organic 40 22
2015 banana natural 20
2015 banana organic 15
2016 banana natural 20
2016 banana organic 18
However, the values from only the last condition are printed:
year type1 type2 price price_new
2015 apple natural 40 40
2015 apple organic 35 35
2016 apple natural 44 44
2016 apple organic 40 22
2015 banana natural 20
2015 banana organic 15
2016 banana natural 20
2016 banana organic 18
- How could the
price_newcolumn get the new values for all conditions - In real data i have more than 10 types in
type1column. Is there an efficient way to write this instead of writing it for each unique value intype1column?
CodePudding user response:
Answering first question:
- First copy all prices into
price_new - Then use
df["price_new"]as the default value innp.wherecalls
df["price_new"] = df["price"]
df["price_new"] = np.where(condition, value, df["price_new"])
Answering second question:
You can reduce code by putting all conditions and corresponding values into a suitable data structure (here list of named tuples), and then use it to generate boolean masks (condlist) used by np.select that will compute your result.
Note that I use dicts and dict.setdefault when computing masks so masks that appear more than once are only computed once.
from collections import namedtuple
Condition = namedtuple("Condition", ["year", "type1", "type2"])
Mapping = namedtuple("Mapping", ["condition", "value"])
mappings = [
Mapping(Condition(year=2015, type1="apple", type2="natural"), value=25),
Mapping(Condition(year=2016, type1="apple", type2="natural"), value=26),
Mapping(Condition(year=2015, type1="apple", type2="organic"), value=20),
Mapping(Condition(year=2016, type1="apple", type2="organic"), value=22),
]
conditions, values = zip(*mappings)
years, types1, types2 = {}, {}, {}
condlist = [
(
years.setdefault(year, df["year"] == year)
& types1.setdefault(type1, df["type1"] == type1)
& types2.setdefault(type2, df["type2"] == type2)
)
for year, type1, type2 in conditions
]
df["price_new"] = np.select(condlist, values, default=df["price"])
Check if it works as expected:
expected = [25, 20, 26, 22, 20, 15, 20, 18]
print(f"{np.all(df['price_new'] == expected) = }") # True
Old answer:
You can put all your conditions and matching values into a suitable data structure, and then process with computation.
Here I choose a list of tuples (named tuples for clarity) where first element of the tuple is the condition and second element is the corresponding value.
from collections import namedtuple
Mapping = namedtuple("Mapping", ["where", "value"])
mappings = [
Mapping(
where=(
(df["year"] == 2015)
& (df["type1"] == "apple")
& (df["type2"].isin(["natural"]))
),
value=25,
),
Mapping(
where=(
(df["year"] == 2016)
& (df["type1"] == "apple")
& (df["type2"].isin(["natural"]))
),
value=26,
),
Mapping(
where=(
(df["year"] == 2015)
& (df["type1"] == "apple")
& (~df["type2"].isin(["natural"]))
),
value=20,
),
Mapping(
where=(
(df["year"] == 2016)
& (df["type1"] == "apple")
& (~df["type2"].isin(["natural"]))
),
value=22,
),
]
From there you can either:
- Rewrite the previous code as a for loop:
df["price_new_1"] = df["price"]
for where, value in mappings:
df["price_new_1"] = np.where(where, value, df["price_new_1"])
- Use
np.selectto compute all at once:
condlist, choicelist = zip(*mappings)
df["price_new_2"] = np.select(condlist, choicelist, default=df["price"])
Check if it works as expected:
expected = [25, 20, 26, 22, 20, 15, 20, 18]
print(f"{np.all(df['price_new_1'] == expected) = }") # True
print(f"{np.all(df['price_new_2'] == expected) = }") # True
