I applied on my dataframe the next command
df['date_article'] = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)')
And this created the column 'date_article'
| pagePath | date_article |
|---|---|
| '/empresas/2021/10/22/tiendas-no-participan-buen' | {'digit': '/2021/10/22/'} |
| '/finanzas-personales/2021/10/22/pueden-cobrar-c | {'digit': '/2021/10/22/'} |
Now I want to left only the date in 'date_article'.
Expected output
| pagePath | date_article |
|---|---|
| '/empresas/2021/10/22/tiendas-no-participan-buen' | '/2021/10/22/' |
| /finanzas-personales/2021/10/22/pueden-cobrar-c | '/2021/10/22/' |
I tried many things but nothing seems to work
Thank you in advance for help
CodePudding user response:
How about the following:
df['date_article'] = df.apply(lambda x: x['digit'], axis=1)
CodePudding user response:
It appears that extract_regex returns a struct series:
Extract substrings defined by a regular expression using Apache Arrow (Google RE2 library).
Parameters
pattern (str) – A regular expression which needs to contain named capture groups, e.g. ‘letter’ and ‘digit’ for the regular expression‘(?P[ab])(?Pd)’.
Returns
an expression containing a struct with field names corresponding to capture group identifiers.
So you will need to extract the field you want out of the struct. I'm not a Vaex expert but maybe something like:
struct_series = df.pagePath.str.extract_regex(pattern='(?P<digit>/\d{4}/\d{2}/\d{2}/)')
df['date_article'] = struct_series.struct.get('digit')
CodePudding user response:
Use:
df = pd.DataFrame({'date_article':[{'digit': '/2021/10/22/'}]})
df['date_article'] = df['date_article'].apply(lambda x: x['digit'])
This uses a lambda function which returns the value of digit key on the specified column and assigns it again. Why you do not use just the following:
df = pd.DataFrame({'date_article':['sdfsdf/2021/10/22/']})
df['date_article'] = df['date_article'].str.extract('(/\d{4}/\d{2}/\d{2}/)')
