I have a question about how to quickly fillna with a sequence in Python(pandas).I have a dataset like following(the true dataset is longer),
| Time | Number |
|---|---|
| t0 | NA |
| t1 | NA |
| t2 | NA |
| t3 | 0 |
| t4 | NA |
| t5 | NA |
| t6 | NA |
| t7 | NA |
| t8 | 0 |
| t9 | NA |
My requirement is to add numbers to N lines before and after non-blank lines, and the sequence range is range(-N,N 1).The interval between any two non-empty rows in the dataset is greater than C(constant), our N will be less than C, so there is no need to consider the coverage problem for the time being. Assuming N=2, the result I need is as follows :
| Time | Number |
|---|---|
| t0 | NA |
| t1 | -2 |
| t2 | -1 |
| t3 | 0 |
| t4 | 1 |
| t5 | 2 |
| t6 | -2 |
| t7 | -1 |
| t8 | 0 |
| t9 | 1 |
At present, the only way I can think of is to use a loop, but the efficiency is low. Does pandas have such a method to do it quickly?
CodePudding user response:
There are still some unknowns in your question, like what happens if the intervals overlap. Here I will consider that a further interval overwrites the previous one (you can do the other way around with a change of code).
Using rolling, groupby.cumcount, and a mask:
s = df['Number'].notna().shift(-N)
m = s.rolling(2*N 1, min_periods=1).max().astype(bool)
df['Number2'] = df.groupby(s.cumsum()).cumcount().sub(N).where(m)
NB. I used a slightly different example to show the overlap.
output:
Time Number Number2
0 t0 NaN NaN
1 t1 NaN -2.0
2 t2 NaN -1.0
3 t3 0.0 0.0
4 t4 NaN 1.0
5 t5 NaN -2.0 # here we have an overlap, use latter value
6 t6 NaN -1.0
7 t7 0.0 0.0
8 t8 NaN -2.0
9 t9 NaN -1.0
CodePudding user response:
Could we have a bit more information, please?
I notice in your end result t0 is still NA, why is that? Do you still want to keep some values as NaN?
In the example, t3 was 0 originally, and in the result it was still 0, which made sense within the sequence. But what if it was already populated, and not with 0 but another value like 5? Would you want to keep the 5 or assign the 0 from the sequence in that position? And if you kept the 5, would then t4 be 0?
Here's a few solutions just in case:
To fill in the NaN values with a sequence (skipping populated values)
#!/usr/bin/env python
import pandas as pd
df = pd.DataFrame([
{'Time': 't0', 'Number': None},
{'Time': 't1', 'Number': None},
{'Time': 't2', 'Number': None},
{'Time': 't3', 'Number': 0},
{'Time': 't4', 'Number': None},
{'Time': 't5', 'Number': None},
])
len_null = df['Number'].isna().sum()
starting_number = -2
df.loc[df['Number'].isna(), 'Number'] = list(range(starting_number, len_null-abs(starting_number)))
print(df)
Will give you this:
Time Number
0 t0 -2.0
1 t1 -1.0
2 t2 0.0
3 t3 0.0
4 t4 1.0
5 t5 2.0
To fill in the entire column with a sequence (overwriting populated values)
#!/usr/bin/env python
import pandas as pd
df = pd.DataFrame([
{'Time': 't0', 'Number': None},
{'Time': 't1', 'Number': None},
{'Time': 't2', 'Number': None},
{'Time': 't3', 'Number': 0},
{'Time': 't4', 'Number': None},
{'Time': 't5', 'Number': None},
])
starting_number = -2
df['Number'] = list(range(starting_number, len(df)-abs(starting_number)))
print(df)
Will give you:
Time Number
0 t0 -2
1 t1 -1
2 t2 0
3 t3 1
4 t4 2
5 t5 3
To fill in the NAN values with a sequence (skipping the sequence where values are populated)
#!/usr/bin/env python
import pandas as pd
df = pd.DataFrame([
{'Time': 't0', 'Number': None},
{'Time': 't1', 'Number': None},
{'Time': 't2', 'Number': None},
{'Time': 't3', 'Number': 0},
{'Time': 't4', 'Number': None},
{'Time': 't5', 'Number': None},
])
starting_number = -2
new_series = pd.Series(list(range(starting_number, len(df)-abs(starting_number))))
df['Number'] = df['Number'].fillna(new_series)
print(df)
Will give you:
Time Number
0 t0 -2.0
1 t1 -1.0
2 t2 0.0
3 t3 0.0
4 t4 2.0
5 t5 3.0
(Notice that 1 would have gone into the t3 slot, but since there was a 0 there already, it just skipped it)
