Home > Net >  Insert rows to fill years gap in Pandas data frame
Insert rows to fill years gap in Pandas data frame

Time:02-01

I'm having the following DataFrame:

import pandas as pd

data = {'id':  ['A', 'A','B', 'C'],
        'year': [2002,2002, 2003, 2004],
        'city':['London', 'Rome','Paris', 'Berlin'],
        'appearence': [1,1,1,1]}

df = pd.DataFrame(data)

I want to fill gaps in the year column starting from 2000 till the year when appearance equals 1. The column appearance is always equal to 1 in the input DataFrame. Please note that each ID could be in two different cities in the same year.

The desired output:

import pandas as pd

data = {'id':  ['A', 'A', 'A', 'A', 'A', 'A','B','B','B','B','C','C','C','C','C'],
        'year': [2000, 2001, 2002, 2000, 2001, 2002,2000, 2001, 2002, 2003,2000,2001,2002,2003, 2004],
        'city':['NaN', 'NaN','London','NaN', 'NaN','Rome', 'NaN', 'NaN','NaN','Paris', 'NaN', 'NaN','NaN','NaN','Berlin'],
        'appearence': [0,0,1,0,0,1,0,0,0,1,0,0,0,0,1]}

df = pd.DataFrame(data)

CodePudding user response:

Solution for prepend years starting per 2000, working if appearence=1 in input DataFrame with DataFrame.reindex by 2000 and maximal year range:

f = lambda x: x.set_index('year').reindex(range(2000, x['year'].max()   1))
df = (df.groupby(['city', 'id'])
        .apply(f)
        .drop('id', axis=1)
        .droplevel(0)
        .reset_index()
        .fillna({'appearence':0}, downcast='int')
        )
        
print (df)
   id  year    city  appearence
0   C  2000     NaN           0
1   C  2001     NaN           0
2   C  2002     NaN           0
3   C  2003     NaN           0
4   C  2004  Berlin           1
5   A  2000     NaN           0
6   A  2001     NaN           0
7   A  2002  London           1
8   B  2000     NaN           0
9   B  2001     NaN           0
10  B  2002     NaN           0
11  B  2003   Paris           1
12  A  2000     NaN           0
13  A  2001     NaN           0
14  A  2002    Rome           1

CodePudding user response:

One option is to use complete from pyjanitor, to abstract the reshaping (explicitly expose missing rows):

# pip install pyjanitor
import pandas as pd
import janitor

# create dictionary for new dates
dates = {"year": lambda df: range(2000, df.max()   1)}

# execute complete, and fill the nulls with 0
(df.complete(dates, by="id", sort=True)
   .fillna({"appearence": 0}, downcast="infer")
 )
   id  year    city  appearence
0   A  2000     NaN           0
1   A  2001     NaN           0
2   A  2002  London           1
3   B  2000     NaN           0
4   B  2001     NaN           0
5   B  2002     NaN           0
6   B  2003   Paris           1
7   C  2000     NaN           0
8   C  2001     NaN           0
9   C  2002     NaN           0
10  C  2003     NaN           0
11  C  2004  Berlin           1

CodePudding user response:

import pandas as pd
start = 2000
data = {'id':  ['A', 'B', 'C'],
        'year': [2002, 2003, 2004],
        'city':['London', 'Paris', 'Berlin'],
        'appearence': [1,1,1]}
row1 = []
row2 = []
row3 = []
row4 = []
counter = 0;
for i in data['year']:
        for j in range(start,i 1):
                row1.append(data['id'][counter])
                row2.append(j)
                row3.append("NaN")
                row4.append(0)
        row4.pop()
        row4.append(data['appearence'][counter])
        row3.pop()
        row3.append(data['city'][counter])
        counter = counter   1
data = {'id':  row1,
        'year': row2,
        'city':row3,
        'appearence': row4}
df = pd.DataFrame(data)

enter image description here

  •  Tags:  
  • Related