Let's say I have the dataset:
df1 = pd.DataFrame()
df1['number'] = [0,0,0,0,0]
df1["decade"] = ["1970", "1980", "1990", "2000", "2010"]`
print(df1)
#output:
number decade
0 0 1970
1 0 1980
2 0 1990
3 0 2000
4 0 2010
and I want to merge it with another dataset:
df2 = pd.DataFrame()
df2['number'] = [1,1]
df2["decade"] = ["1990", "2010"]
print(df2)
#output:
number decade
0 1 1990
1 1 2010
such that it get's values only from the decades from df2 that have values in them and leaves the others untouched, yielding:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010
how must one go about doing that in pandas? I've tried stuff like join, merge, and concat but they all seem to either not give the desired result or not work because of the different dimensions of the 2 datasets. Any suggestions regarding which function I should be looking at?
Thank you so much!
CodePudding user response:
You can use pandas.DataFrame.merge with pandas.Series.fillna :
out = (
df1[["decade"]]
.merge(df2, on="decade", how="left")
.fillna({"number": df1["number"]}, downcast="infer")
)
# Output :
print(out)
decade number
0 1970 0
1 1980 0
2 1990 1
3 2000 0
4 2010 1
CodePudding user response:
What about using apply?
First you create a function
def validation(previous,latest):
if pd.isna(latest):
return previous
else:
return latest
Then you can use the function dataframe.apply to compare the data in df1 to df2
df1['number'] = df1.apply(lambda row: validation(row['number'],df2.loc[df2['decade'] == row.decade].number.max()),axis = 1)
Your result:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010
