How do I split the ID from annotation by using regex in the data frame below?
df=pd.DataFrame({"header":["SS50377_28860 All-trans-retinol 13,14-reductase"]})
So the columns supposed to be like this:
df_new=pd.DataFrame({"id":"SS50377_28860","header":["All-trans-retinol 13,14-reductase"]})
The following code doesn't work properly.
df.join(df["header"].str.split(r'\d ', 0, expand=True))
Thanks in advance!!
CodePudding user response:
You can split with one or more whitespaces between a digit and a letter:
df[['id','header']] = df['header'].str.split(r'(?<=\d)\s (?=[A-Z])', n=1, expand=True)
Or, you may capture the ID pattern into one group and the rest into another:
df[['id', 'header']] = df['header'].str.extract(r'^([A-Z0-9] _[A-Z0-9] )\s (.*)', expand=True)
Or, you may simply Series.str.split with the first whitespace chunk:
df[['id', 'header']] = df['header'].str.split("\s ", n=1, expand=True)
Output:
>>> df
header id
0 All-trans-retinol 13,14-reductase SS50377_28860
Details:
(?<=\d)\s (?=[A-Z])- matches one or more whitespaces (\s) that are immediately preceded with a digit ((?<=\d)) and immediately followed with an uppercase ASCII letter ([A-Z])^([A-Z0-9] _[A-Z0-9] )\s (.*)- matches start of string (^), then captures one or more uppercase ASCII letters or digits,_and again one or more uppercase ASCII letters or digits into Group 1 (Column "id") and then matches one or more whitespaces (\s) and then captures the rest of the line into Group 2 (with(.*)).
Whichever solution you choose depends on how varied your input is and how much validation you want to apply here.
