Compare two df's and populate value from one df to another-CodePudding

I would like to compare the amount_spent column in df1 and df2 and if the amount_spent column in df1 is null (null is a string not nan) then populate the value from df2 to df1 for that particular customer_id.

df1

customer_id  amount_spent 
3021         144
0535         042
7532         null 
2131         932

df2

3021         144
0535         042
7532         945

Desired output df

3021         144
0535         042
7532         945

CodePudding user response：

Try this:

import pandas as pd

df1 = pd.DataFrame({"customer_id": ['3021', '0535', '7532'], "amount_spent": ['144', '042', 'null']})
df2 = pd.DataFrame({"customer_id": ['3021', '0535', '7532'], "amount_spent": ['144', '042', '945']})

null_list = df1.index[df1['amount_spent'] == 'null'].tolist()

for null in null_list:
    df1["amount_spent"][int(null)] = df2["amount_spent"][int(null)]

It creates a list with all indices that underly the condition and populates the right value.

CodePudding user response：

You can try:

import pandas as pd

df1 = pd.DataFrame({"customer_id": ['3021', '0535', '7532'], "amount_spent": ['144', '042', 'null']})
df2 = pd.DataFrame({"customer_id": ['3021', '0535', '7532'], "amount_spent": ['144', '042', '945']})


df1 = df1.set_index("customer_id")
df2 = df2.set_index("customer_id")
df1.loc[df1['amount_spent'] == "null", 'amount_spent'] = df2['amount_spent']
df1 = df1.reset_index()
print(df1)

It gives:

  customer_id amount_spent
0        3021          144
1        0535          042
2        7532          945

CodePudding user response：

import pandas as pd
from numpy import nan

data_1 = [['3021', '144'], ['0535', '042'], ['7532', nan]]
data_2 = [['3021', '144' ], ['0535', '042'], ['7532', '945']]

df_1 = pd.DataFrame(data_1, columns = ['customer_id', 'amount_spent'])
df_2 = pd.DataFrame(data_2, columns = ['customer_id', 'amount_spent'])

print(df_1.fillna(df_2))

output

customer_id amount_spent
0        3021          144
1        0535          042
2        7532          945

Edit:

Since you modified your post adding a row to the first dataset, this is a version which handles the situation in which the two dataset have different sizes, reducing both to the smaller one.

import pandas as pd
from numpy import nan

data_1 = [['3021', '144'], ['0535', '042'], ['7532', 'null'], ['2121', '932']]
data_2 = [['3021', '144' ], ['0535', '042'], ['7532', '945']]

df_1 = pd.DataFrame(data_1, columns = ['customer_id', 'amount_spent'])
df_2 = pd.DataFrame(data_2, columns = ['customer_id', 'amount_spent'])

# replace 'null' with nan 
df_1 = df_1.replace('null', nan)

# reducing the dataframes to the same number of rows
df_1_rows = len(df_1.index)
df_2_rows = len(df_2.index)
cutoff = df_1_rows if df_1_rows < df_2_rows else df_2_rows
df_1 = df_1[:cutoff]
df_2 = df_2[:cutoff]


print(df_1.fillna(df_2))

In the first version, I used nan instead of 'null' since I didnì't read carefully. In this second version, I used 'null' and replaced it with nan (the code is commented). If you are reading your dataframe from a csv or similar, it's possible to avoid this replacing by using na_values parameter of read_csv