I'm looking for a Python-based solution to a common task I often run into.
(I know, there must be cool SQL-magic for this, but... not now.)
The situation is:
Given a table, we want to change the field f2 if oid is the same.
(Assumed that f2 always contains bar or is empty if oid is 42)
| id | f1 | f2 | oid |
|---|---|---|---|
| 1 | foo | bar | 42 |
| 2 | foo | 49 | |
| 3 | baz | 42 |
Becomes:
| id | f1 | f2 | oid |
|---|---|---|---|
| 1 | foo | bar | 42 |
| 2 | foo | 49 | |
| 3 | baz | bar | 42 |
I'm currently using itertools.combinations to avoid repeated comparison.
This seems more efficient than a double for-loop.
But for very large datasets, it still becomes unhandy... considering the Big-O-Notation.
My question is: Can the problem be solved more elegantly via pandas or numpy?
CodePudding user response:
Assumed that f2 always contains bar or is empty if oid is 42
Maybe we could use groupby transform first as well:
df['f2'] = df.replace('', np.nan).groupby('oid')['f2'].transform('first').fillna('')
Output:
id f1 f2 oid
0 1 foo bar 42
1 2 foo 49
2 3 baz bar 42
CodePudding user response:
The only solution I can think of would be to use groupby ffill:
df['f2'] = df.replace('', np.nan).groupby('oid')['f2'].ffill()
Output:
>>> df
id f1 f2 oid
0 1 foo bar 42
1 2 foo NaN 49
2 3 baz bar 42
