I am importing a call center data table into pandas from an external reporting software and want to transform it for further use. I have a problem that one of the columns contains two types of information: dates and names. I would like to make two columns out of the first one to be able to create a multiindex.
How it looks now:
names_dates incoming_calls
Maria 20
Jan 03 4
Jan 04 10
Jan 05 6
Nick 22
Jan 03 5
Jan 04 9
Jan 05 4
Jan 06 4
Maria has received 20 calls over the week: 4 calls on the 3rd of January, 10 on the 4th and 6 on the 5th. Nick has worked 4 days during this week and has his own allocation of received calls.
I would like to transform the table as follows:
names dates incoming_calls
Maria Jan 03 4
Maria Jan 04 10
Maria Jan 05 6
Nick Jan 03 5
Nick Jan 04 9
Nick Jan 05 4
Nick Jan 06 4
Initialization code for your convenience:
data = {'names_dates': ['Maria', 'Jan 03', 'Jan 04', 'Jan 05',
'Nick', 'Jan 03', 'Jan 04','Jan 05','Jan 06'],
'incoming_calls': [20, 4, 10, 6, 22, 5, 9, 4, 4]}
df = pd.DataFrame(data)
# next steps?
I was searching for similar questions but only found threads on how to split columns based on certain criteria. This case is not a simple split. Appreciate your suggestions.
CodePudding user response:
Here's another way:
Presumably a name doesn't contain a number, so first we create a filter for names depending on whether an entry in "names_dates" contains a number or not. Then we assign group number depending on unique names using cumsum. We create a group number to name mapper using groupby_first and use it to assign names. Finally, use check_for_names filter again to only select the rows without names in names_dates and modify column names to fit the desired outcome.
check_for_names = ~df['names_dates'].str.contains(r'\d')
df['name'] = check_for_names.cumsum()
mapper = df.groupby('name')['names_dates'].first()
df['name'] = df['name'].map(mapper)
df = df[~check_for_names].rename(columns={'names_dates':'dates'})[['name','dates','incoming_calls']]
Output:
name dates incoming_calls
1 Maria Jan 03 4
2 Maria Jan 04 10
3 Maria Jan 05 6
5 Nick Jan 03 5
6 Nick Jan 04 9
7 Nick Jan 05 4
8 Nick Jan 06 4
CodePudding user response:
Try this:
df = df.groupby((~df['names_dates'].str.match(r'\w{3} \d{2}')).cumsum()).apply(lambda g: g.iloc[1:].assign(names=g['names_dates'].iloc[0])).reset_index(drop=True).rename({'names_dates':'dates'},axis=1)
Output:
>>> df
dates incoming_calls names
0 Jan 03 4 Maria
1 Jan 04 10 Maria
2 Jan 05 6 Maria
3 Jan 03 5 Nick
4 Jan 04 9 Nick
5 Jan 05 4 Nick
6 Jan 06 4 Nick
