Transform a column with mixed types into two homogeneous columns-CodePudding

I am importing a call center data table into pandas from an external reporting software and want to transform it for further use. I have a problem that one of the columns contains two types of information: dates and names. I would like to make two columns out of the first one to be able to create a multiindex.

How it looks now:

 names_dates  incoming_calls 
   Maria             20       
   Jan 03            4        
   Jan 04            10       
   Jan 05            6        
   Nick              22       
   Jan 03            5        
   Jan 04            9        
   Jan 05            4 
   Jan 06            4

Maria has received 20 calls over the week: 4 calls on the 3rd of January, 10 on the 4th and 6 on the 5th. Nick has worked 4 days during this week and has his own allocation of received calls.

I would like to transform the table as follows:

   names         dates      incoming_calls 
   Maria         Jan 03           4       
   Maria         Jan 04           10   
   Maria         Jan 05           6      
   Nick          Jan 03           5        
   Nick          Jan 04           9        
   Nick          Jan 05           4  
   Nick          Jan 06           4

Initialization code for your convenience:

data = {'names_dates': ['Maria', 'Jan 03', 'Jan 04', 'Jan 05', 
                        'Nick', 'Jan 03', 'Jan 04','Jan 05','Jan 06'],
        'incoming_calls': [20, 4, 10, 6, 22, 5, 9, 4, 4]}

df = pd.DataFrame(data)

# next steps?

I was searching for similar questions but only found threads on how to split columns based on certain criteria. This case is not a simple split. Appreciate your suggestions.

CodePudding user response：

Here's another way:

Presumably a name doesn't contain a number, so first we create a filter for names depending on whether an entry in "names_dates" contains a number or not. Then we assign group number depending on unique names using cumsum. We create a group number to name mapper using groupby_first and use it to assign names. Finally, use check_for_names filter again to only select the rows without names in names_dates and modify column names to fit the desired outcome.

check_for_names = ~df['names_dates'].str.contains(r'\d')
df['name'] = check_for_names.cumsum()
mapper = df.groupby('name')['names_dates'].first()
df['name'] = df['name'].map(mapper)
df = df[~check_for_names].rename(columns={'names_dates':'dates'})[['name','dates','incoming_calls']]

Output:

    name   dates  incoming_calls
1  Maria  Jan 03               4
2  Maria  Jan 04              10
3  Maria  Jan 05               6
5   Nick  Jan 03               5
6   Nick  Jan 04               9
7   Nick  Jan 05               4
8   Nick  Jan 06               4

CodePudding user response：

Try this:

df = df.groupby((~df['names_dates'].str.match(r'\w{3} \d{2}')).cumsum()).apply(lambda g: g.iloc[1:].assign(names=g['names_dates'].iloc[0])).reset_index(drop=True).rename({'names_dates':'dates'},axis=1)

Output:

>>> df
    dates  incoming_calls  names
0  Jan 03               4  Maria
1  Jan 04              10  Maria
2  Jan 05               6  Maria
3  Jan 03               5   Nick
4  Jan 04               9   Nick
5  Jan 05               4   Nick
6  Jan 06               4   Nick