I'm having the following DataFrame:
import pandas as pd
data = {'id': ['A', 'A','B', 'C'],
'year': [2002,2002, 2003, 2004],
'city':['London', 'Rome','Paris', 'Berlin'],
'appearence': [1,1,1,1]}
df = pd.DataFrame(data)
I want to fill gaps in the year column starting from 2000 till the year when appearance equals 1. The column appearance is always equal to 1 in the input DataFrame. Please note that each ID could be in two different cities in the same year.
The desired output:
import pandas as pd
data = {'id': ['A', 'A', 'A', 'A', 'A', 'A','B','B','B','B','C','C','C','C','C'],
'year': [2000, 2001, 2002, 2000, 2001, 2002,2000, 2001, 2002, 2003,2000,2001,2002,2003, 2004],
'city':['NaN', 'NaN','London','NaN', 'NaN','Rome', 'NaN', 'NaN','NaN','Paris', 'NaN', 'NaN','NaN','NaN','Berlin'],
'appearence': [0,0,1,0,0,1,0,0,0,1,0,0,0,0,1]}
df = pd.DataFrame(data)
CodePudding user response:
Solution for prepend years starting per 2000, working if appearence=1 in input DataFrame with DataFrame.reindex by 2000 and maximal year range:
f = lambda x: x.set_index('year').reindex(range(2000, x['year'].max() 1))
df = (df.groupby(['city', 'id'])
.apply(f)
.drop('id', axis=1)
.droplevel(0)
.reset_index()
.fillna({'appearence':0}, downcast='int')
)
print (df)
id year city appearence
0 C 2000 NaN 0
1 C 2001 NaN 0
2 C 2002 NaN 0
3 C 2003 NaN 0
4 C 2004 Berlin 1
5 A 2000 NaN 0
6 A 2001 NaN 0
7 A 2002 London 1
8 B 2000 NaN 0
9 B 2001 NaN 0
10 B 2002 NaN 0
11 B 2003 Paris 1
12 A 2000 NaN 0
13 A 2001 NaN 0
14 A 2002 Rome 1
CodePudding user response:
One option is to use complete from pyjanitor, to abstract the reshaping (explicitly expose missing rows):
# pip install pyjanitor
import pandas as pd
import janitor
# create dictionary for new dates
dates = {"year": lambda df: range(2000, df.max() 1)}
# execute complete, and fill the nulls with 0
(df.complete(dates, by="id", sort=True)
.fillna({"appearence": 0}, downcast="infer")
)
id year city appearence
0 A 2000 NaN 0
1 A 2001 NaN 0
2 A 2002 London 1
3 B 2000 NaN 0
4 B 2001 NaN 0
5 B 2002 NaN 0
6 B 2003 Paris 1
7 C 2000 NaN 0
8 C 2001 NaN 0
9 C 2002 NaN 0
10 C 2003 NaN 0
11 C 2004 Berlin 1
CodePudding user response:
import pandas as pd
start = 2000
data = {'id': ['A', 'B', 'C'],
'year': [2002, 2003, 2004],
'city':['London', 'Paris', 'Berlin'],
'appearence': [1,1,1]}
row1 = []
row2 = []
row3 = []
row4 = []
counter = 0;
for i in data['year']:
for j in range(start,i 1):
row1.append(data['id'][counter])
row2.append(j)
row3.append("NaN")
row4.append(0)
row4.pop()
row4.append(data['appearence'][counter])
row3.pop()
row3.append(data['city'][counter])
counter = counter 1
data = {'id': row1,
'year': row2,
'city':row3,
'appearence': row4}
df = pd.DataFrame(data)
