I have a list series of pandas dataframes of different lengths. I need to transform them into the dataframes of the same (maximum) length max_len.
0:
2021-01-20 11:20 4
2021-01-20 11:22 5
2021-01-20 11:23 4
2021-01-20 11:40 3
1:
2021-01-20 11:00 1
2021-01-20 11:22 2
2021-01-20 11:23 4
2021-01-20 11:40 3
2021-01-21 10:00 1
2:
2021-01-20 11:20 2
2021-01-20 11:22 5
2021-01-20 11:23 4
2021-01-20 11:40 3
2021-01-21 9:00 2
Expected result:
In this example the maximum length is 7. Thus, all dataframes should have the same length:
0:
2021-01-20 11:00 NaN
2021-01-20 11:20 4
2021-01-20 11:22 5
2021-01-20 11:23 4
2021-01-20 11:40 3
2021-01-21 9:00 NaN
2021-01-21 10:00 NaN
1:
2021-01-20 11:00 1
2021-01-20 11:20 NaN
2021-01-20 11:22 2
2021-01-20 11:23 4
2021-01-20 11:40 3
2021-01-21 9:00 NaN
2021-01-21 10:00 1
2:
2021-01-20 11:00 NaN
2021-01-20 11:20 2
2021-01-20 11:22 5
2021-01-20 11:23 4
2021-01-20 11:40 3
2021-01-21 9:00 2
2021-01-21 10:00 NaN
This is my code:
for i in range(len(series)):
if len(series[i])!= max_len:
series[i] = series[i].reindex(longest_series.index)
However, the reindex results in an i-th dataframe with all Nan values. Also, I'm not sure how to retrieve max_len.
How can I have Nan values only in new index values in each dataframe?
Update:
selected_series = []
l = pd.DataFrame([{'2021-01-20 11:20': 4, '2021-01-20 11:22': 5, '2021-01-20 11:23': 4, '2021-01-20 11:40': 3}])
selected_series.append(l)
l = pd.DataFrame([{'2021-01-20 11:00': 1, '2021-01-20 11:22': 2, '2021-01-20 11:23': 4, '2021-01-20 11:40': 3, '2021-01-21 10:00': 1}])
selected_series.append(l)
l = pd.DataFrame([{'2021-01-20 11:20': 2, '2021-01-20 11:22': 5, '2021-01-20 11:23': 4, '2021-01-20 11:40': 3, '2021-01-21 9:00': 2}])
selected_series.append(l)
CodePudding user response:
If you do indeed have a Series of DataFrame objects, then this should work:
s.apply(lambda df: pd.merge(template.copy(), df, on="date", how="outer"))
Where s is a series of DataFrame objects with datetime column "date" and "value" column of any type, and template is a DataFrame representing all the rows each DataFrame in your series should have:
date
0 2021-01-20 11:00:00
1 2021-01-20 11:20:00
2 2021-01-20 11:22:00
3 2021-01-20 11:23:00
4 2021-01-20 11:40:00
5 2021-01-21 09:00:00
6 2021-01-21 10:00:00
Note: your DataFrame objects in your series need to have column names for this to work easily.
Result:
In [5]: s[0] # first dataframe in series s
date value
0 2021-01-20 11:00:00 NaN
1 2021-01-20 11:20:00 4.0
2 2021-01-20 11:22:00 5.0
3 2021-01-20 11:23:00 4.0
4 2021-01-20 11:40:00 3.0
5 2021-01-21 09:00:00 NaN
6 2021-01-21 10:00:00 NaN
To create a template DataFrame, from e.g. June 20, 2020 to June 21, 2020, in increments of 5 minutes:
template = pd.DataFrame(pd.date_range("2020-06-20", "2020-06-22", freq="5T"))
CodePudding user response:
you can use pandas.concat on the columns, then split again:
df = pd.concat(selected_series).T
lst = list(zip(*df.iteritems()))[1]
output:
(2021-01-20 11:20 4.0
2021-01-20 11:22 5.0
2021-01-20 11:23 4.0
2021-01-20 11:40 3.0
2021-01-20 11:00 NaN
2021-01-21 10:00 NaN
2021-01-21 9:00 NaN
Name: 0, dtype: float64,
2021-01-20 11:20 NaN
2021-01-20 11:22 2.0
2021-01-20 11:23 4.0
2021-01-20 11:40 3.0
2021-01-20 11:00 1.0
2021-01-21 10:00 1.0
2021-01-21 9:00 NaN
Name: 0, dtype: float64,
2021-01-20 11:20 2.0
2021-01-20 11:22 5.0
2021-01-20 11:23 4.0
2021-01-20 11:40 3.0
2021-01-20 11:00 NaN
2021-01-21 10:00 NaN
2021-01-21 9:00 2.0
Name: 0, dtype: float64)
