I would like to compare the amount_spent column in df1 and df2 and if the amount_spent column in df1 is null (null is a string not nan) then populate the value from df2 to df1 for that particular customer_id.
df1
customer_id amount_spent
3021 144
0535 042
7532 null
2131 932
df2
3021 144
0535 042
7532 945
Desired output df
3021 144
0535 042
7532 945
CodePudding user response:
Try this:
import pandas as pd
df1 = pd.DataFrame({"customer_id": ['3021', '0535', '7532'], "amount_spent": ['144', '042', 'null']})
df2 = pd.DataFrame({"customer_id": ['3021', '0535', '7532'], "amount_spent": ['144', '042', '945']})
null_list = df1.index[df1['amount_spent'] == 'null'].tolist()
for null in null_list:
df1["amount_spent"][int(null)] = df2["amount_spent"][int(null)]
It creates a list with all indices that underly the condition and populates the right value.
CodePudding user response:
You can try:
import pandas as pd
df1 = pd.DataFrame({"customer_id": ['3021', '0535', '7532'], "amount_spent": ['144', '042', 'null']})
df2 = pd.DataFrame({"customer_id": ['3021', '0535', '7532'], "amount_spent": ['144', '042', '945']})
df1 = df1.set_index("customer_id")
df2 = df2.set_index("customer_id")
df1.loc[df1['amount_spent'] == "null", 'amount_spent'] = df2['amount_spent']
df1 = df1.reset_index()
print(df1)
It gives:
customer_id amount_spent
0 3021 144
1 0535 042
2 7532 945
CodePudding user response:
import pandas as pd
from numpy import nan
data_1 = [['3021', '144'], ['0535', '042'], ['7532', nan]]
data_2 = [['3021', '144' ], ['0535', '042'], ['7532', '945']]
df_1 = pd.DataFrame(data_1, columns = ['customer_id', 'amount_spent'])
df_2 = pd.DataFrame(data_2, columns = ['customer_id', 'amount_spent'])
print(df_1.fillna(df_2))
output
customer_id amount_spent
0 3021 144
1 0535 042
2 7532 945
Edit:
Since you modified your post adding a row to the first dataset, this is a version which handles the situation in which the two dataset have different sizes, reducing both to the smaller one.
import pandas as pd
from numpy import nan
data_1 = [['3021', '144'], ['0535', '042'], ['7532', 'null'], ['2121', '932']]
data_2 = [['3021', '144' ], ['0535', '042'], ['7532', '945']]
df_1 = pd.DataFrame(data_1, columns = ['customer_id', 'amount_spent'])
df_2 = pd.DataFrame(data_2, columns = ['customer_id', 'amount_spent'])
# replace 'null' with nan
df_1 = df_1.replace('null', nan)
# reducing the dataframes to the same number of rows
df_1_rows = len(df_1.index)
df_2_rows = len(df_2.index)
cutoff = df_1_rows if df_1_rows < df_2_rows else df_2_rows
df_1 = df_1[:cutoff]
df_2 = df_2[:cutoff]
print(df_1.fillna(df_2))
In the first version, I used nan instead of 'null' since I didnì't read carefully. In this second version, I used 'null' and replaced it with nan (the code is commented). If you are reading your dataframe from a csv or similar, it's possible to avoid this replacing by using na_values parameter of read_csv
