I'm trying to write a regex pattern capturing four different groups, the first groups ends either when we encounter either _ne or _re or a dot, The second groups is an optional one, it captures the re or ne if encountered, otherwise it's empty, the third and fourth group are a bit easier to capture as they are just words proceeded by a dot. here is a code snippet to get a sample data:
import pandas as pd
sample = pd.Series(["abc_ne.c.d", "kc_E5_re.c.d", "kc_E5_re13.c.d", "kc_E5.c.d"]).rename('raw')
using the following pattern (\w )(?:_(ne|re)\d*)\.(\w*)\.(\w*) I can capture most cases
| raw | 0 | 1 | 2 | 3 | |
|---|---|---|---|---|---|
| 0 | abc_ne.c.d | abc | ne | c | d |
| 1 | kc_E5_re.c.d | kc_E5 | re | c | d |
| 2 | kc_E5_re13.c.d | kc_E5 | re | c | d |
| 3 | kc_E5.c.d | nan | nan | nan | nan |
the exception is when the second group is absent, in which case it fails:
I tried making it optional (\w )(?:_(ne|re)\d*)?\\.(\w*)\.(\w*)
but it captures everything in the first groups up to the dot.
| raw | 0 | 1 | 2 | 3 | |
|---|---|---|---|---|---|
| 0 | abc_ne.c.d | abc_ne | nan | c | d |
| 1 | kc_E5_re.c.d | kc_E5_re | nan | c | d |
| 2 | kc_E5_re13.c.d | kc_E5_re13 | nan | c | d |
| 3 | kc_E5.c.d | kc_E5 | nan | c | d |
This snippet could be used to capture groups with pandas if needed:
pattern = r'(\w )(?:_(ne|re)\d*)?\.(\\w*)\.(\w*)'
sample.to_frame().join(sample.str.extract(pattern))
The expected output is:
| raw | 0 | 1 | 2 | 3 | |
|---|---|---|---|---|---|
| 0 | abc_ne.c.d | abc | ne | c | d |
| 1 | kc_E5_re.c.d | kc_E5 | re | c | d |
| 2 | kc_E5_re13.c.d | kc_E5 | re | c | d |
| 3 | kc_E5.c.d | kc_E5 | nan | c | d |
Can anyone help me get the pattern right ?
Thanks in advance.
CodePudding user response:
I'd say you probably want the 2nd group in an optional non-capture group and make the characters captured by the 1st group lazy:
^(\w ?)(?:_([nr]e\d*))?\.(\w )\.(\w )$
See an online demo
^- Start-line anchor;(\w ?)- 1st Capture group to catch 1 (Lazy) word-characters (thus including underscore);(?:_([nr]e\d*))?- Optional non-capture group to match an underscore and an nested 2nd capture group to match both 're' or 'ne' followed by 0 digits;\.(\w )\.(\w )- Match both the 3rd and 4th capture group in succession inbetween literal dots;$- End-line anchor.
