Home > Net >  How to reindex pandas dataframes of different lengths?
How to reindex pandas dataframes of different lengths?

Time:02-08

I have a list series of pandas dataframes of different lengths. I need to transform them into the dataframes of the same (maximum) length max_len.

0:
2021-01-20 11:20    4
2021-01-20 11:22    5
2021-01-20 11:23    4 
2021-01-20 11:40    3

1:
2021-01-20 11:00    1
2021-01-20 11:22    2
2021-01-20 11:23    4 
2021-01-20 11:40    3
2021-01-21 10:00    1

2:
2021-01-20 11:20    2
2021-01-20 11:22    5
2021-01-20 11:23    4 
2021-01-20 11:40    3
2021-01-21 9:00     2

Expected result:

In this example the maximum length is 7. Thus, all dataframes should have the same length:

0:
2021-01-20 11:00    NaN
2021-01-20 11:20    4
2021-01-20 11:22    5
2021-01-20 11:23    4 
2021-01-20 11:40    3
2021-01-21 9:00     NaN
2021-01-21 10:00    NaN

1:
2021-01-20 11:00    1
2021-01-20 11:20    NaN
2021-01-20 11:22    2
2021-01-20 11:23    4 
2021-01-20 11:40    3
2021-01-21 9:00     NaN
2021-01-21 10:00    1

2:
2021-01-20 11:00    NaN
2021-01-20 11:20    2
2021-01-20 11:22    5
2021-01-20 11:23    4 
2021-01-20 11:40    3
2021-01-21 9:00     2
2021-01-21 10:00    NaN

This is my code:

for i in range(len(series)):
    if len(series[i])!= max_len:
        series[i] = series[i].reindex(longest_series.index)

However, the reindex results in an i-th dataframe with all Nan values. Also, I'm not sure how to retrieve max_len.

How can I have Nan values only in new index values in each dataframe?

Update:

selected_series = []

l = pd.DataFrame([{'2021-01-20 11:20': 4, '2021-01-20 11:22': 5, '2021-01-20 11:23': 4, '2021-01-20 11:40': 3}])
selected_series.append(l)
l = pd.DataFrame([{'2021-01-20 11:00': 1, '2021-01-20 11:22': 2, '2021-01-20 11:23': 4, '2021-01-20 11:40': 3, '2021-01-21 10:00': 1}])
selected_series.append(l)
l = pd.DataFrame([{'2021-01-20 11:20': 2, '2021-01-20 11:22': 5, '2021-01-20 11:23': 4, '2021-01-20 11:40': 3, '2021-01-21 9:00': 2}])
selected_series.append(l)

CodePudding user response:

If you do indeed have a Series of DataFrame objects, then this should work:

s.apply(lambda df: pd.merge(template.copy(), df, on="date", how="outer"))

Where s is a series of DataFrame objects with datetime column "date" and "value" column of any type, and template is a DataFrame representing all the rows each DataFrame in your series should have:

                 date
0 2021-01-20 11:00:00
1 2021-01-20 11:20:00
2 2021-01-20 11:22:00
3 2021-01-20 11:23:00
4 2021-01-20 11:40:00
5 2021-01-21 09:00:00
6 2021-01-21 10:00:00

Note: your DataFrame objects in your series need to have column names for this to work easily.

Result:

In [5]: s[0]  # first dataframe in series s
                 date  value
0 2021-01-20 11:00:00    NaN
1 2021-01-20 11:20:00    4.0
2 2021-01-20 11:22:00    5.0
3 2021-01-20 11:23:00    4.0
4 2021-01-20 11:40:00    3.0
5 2021-01-21 09:00:00    NaN
6 2021-01-21 10:00:00    NaN

To create a template DataFrame, from e.g. June 20, 2020 to June 21, 2020, in increments of 5 minutes:

template = pd.DataFrame(pd.date_range("2020-06-20", "2020-06-22", freq="5T"))

CodePudding user response:

you can use pandas.concat on the columns, then split again:

df = pd.concat(selected_series).T
lst = list(zip(*df.iteritems()))[1]

output:

(2021-01-20 11:20    4.0
 2021-01-20 11:22    5.0
 2021-01-20 11:23    4.0
 2021-01-20 11:40    3.0
 2021-01-20 11:00    NaN
 2021-01-21 10:00    NaN
 2021-01-21 9:00     NaN
 Name: 0, dtype: float64,
 2021-01-20 11:20    NaN
 2021-01-20 11:22    2.0
 2021-01-20 11:23    4.0
 2021-01-20 11:40    3.0
 2021-01-20 11:00    1.0
 2021-01-21 10:00    1.0
 2021-01-21 9:00     NaN
 Name: 0, dtype: float64,
 2021-01-20 11:20    2.0
 2021-01-20 11:22    5.0
 2021-01-20 11:23    4.0
 2021-01-20 11:40    3.0
 2021-01-20 11:00    NaN
 2021-01-21 10:00    NaN
 2021-01-21 9:00     2.0
 Name: 0, dtype: float64)
  •  Tags:  
  • Related