As part of a larger data clean up exercise I am using pandas.Series.str.extractall(pat) on a data frame column. Calling extractall() returns a multi-index due to multiple matches. I would like to flatten and merge the result from extractall() into a data frame.
I've spent countless hours with various combinations of apply, agg, groupby, join and haven't made any progress. Any suggestions?
Simplified example
df = pd.DataFrame([["cat 123"], ["hat dog 36776"], ["dog 345"], ["fish 456890 hat"]], columns=['A'])
pat = r'(?:(?P<Mammal>(?:cat)|(?:dog))|(?P<Fish>(?:fish))|(?P<Hat>(?:hat)))'
df_ea = df['A'].str.extractall(pat)
Results in this table
| Mammal | Fish | Hat | ||
|---|---|---|---|---|
| match | ||||
| 0 | 0 | cat | NaN | NaN |
| 1 | 0 | NaN | NaN | hat |
| 1 | dog | NaN | NaN | |
| 2 | 0 | dog | NaN | NaN |
| 3 | 0 | NaN | fish | NaN |
| 1 | NaN | NaN | hat |
Desired result
df_test = pd.DataFrame([["cat", "", ""],
["dog", "", "hat"],
["dog", "", ""],
["", "fish", "hat"]],
columns=['Mammal', 'Fish', 'Hat'])
| Mammal | Fish | Hat | |
|---|---|---|---|
| 0 | cat | ||
| 1 | dog | hat | |
| 2 | dog | ||
| 3 | fish | hat |
CodePudding user response:
You could use groupby as follows:
df_ea = df['A'].str.extractall(pat).groupby(level=0).first()
This will be the result:
Mammal Fish Hat
0 cat None None
1 dog None hat
2 dog None None
3 None fish hat
CodePudding user response:
df_ea=df_ea.reset_index(drop=True).fillna('')
you've already done a great job you're only missing this last line
