I have a df and a column with strings that looks like following:
runtime
1h 38m
20h 4m
5h
45m
empty
and I am trying to apply a function which will convert it to minutes.
So far, I have come up with part of it:
def runtime_to_minutes(string):
try:
capt_numbers = re.compile(r'[\d ][\d ]')
hours = int(re.findall(capt_numbers, string)[0])
minutes = int(re.findall(capt_numbers, string)[1])
duration = hours * 60 minutes
return duration
except Exception as error:
return str(error)
which obviously cannot handle all the situations, although it won't work for '1h 38m' either as I get an error list index out of range when I do: df['minutes'] = df['runtime'].apply(lambda s: runtime_to_minutes(s))
How should I restructure the regex and the function to get the desired result?
CodePudding user response:
You can use
import pandas as pd
df = pd.DataFrame({'runtime':['1h 38m','20h 4m','5h','45m','empty']})
df[['hours', 'minutes']] = df['runtime'].str.extract(r'(?=\d \s*[hm]\b)(?:(\d )\s*h)?(?:\s*(\d )\s*m)?').fillna(0)
df['minutes'] = df['hours'].astype(int) * 60 df['minutes'].astype(int)
df.drop('hours', axis=1, inplace=True)
# => df
# runtime minutes
# 0 1h 38m 98
# 1 20h 4m 1204
# 2 5h 300
# 3 45m 45
# 4 empty 0
See the regex demo. The pattern extracts two captures, hours and minutes. Both parts are optional, but the lookahead makes sure at least one part is present.
(?=\d \s*[hm]\b)- a positive lookahead that requires one or more digits, zero or more whitespaces, and thenhormnot followed with any other word char(?:(\d )\s*h)?- an optional non-capturing group capturing one or more digits into Group 1, and then just matching zero or more whitespaces andh(?:\s*(\d )\s*m)?- an optional non-capturing group matching zero or more whitespaces, then capturing one or more digits into Group 2, and then zero or more whitespaces andmare matched.
If no match occurs, .fillna(0) puts 0 as default value.
The hours and minutes are saved in hours and minutes columns.
Then, the minutes are calculated and hours column is dropped.
CodePudding user response:
def runtime_to_minutes(string):
duration = 0
capt_numbers = re.compile(r'(\d )\s*h')
try:
hours = int(re.findall(capt_numbers, string)[0])
except:
hours=0
capt_numbers = re.compile(r'(\d )\s*m')
try:
minutes = int(re.findall(capt_numbers, string)[0])
except:
minutes = 0
duration = int(hours) * 60 int(minutes)
return duration
liste = [' 1h 38m',
' 20h 4m',
' 5h',
' 45m', 'empty']
for li in liste:
time = runtime_to_minutes(li)
print(time)
Results:
98
1204
300
45
0
I changed your function a little bit, now everything is working. I used 2 try/except blocks and set the number to zero in case of an Exception!
I forgot to add:
It will also work with pandas:
df = pd.DataFrame(liste)
df.columns = ['time']
df['time_2'] = df.time.apply(runtime_to_minutes)
time time_2
0 1h 38m 98
1 20h 4m 1204
2 5h 300
3 45m 45
4 empty 0
