How to separate IDs into different rows using R-CodePudding

I am using R. I have a column in a dataframe. Here is an example of part of the column:

|NEW.ID|
|------|
|P02538 [551-559]; P04259 [551-559]|
|A0A0B4J2F2 1xPhospho [T473]|
|Q8IVF2 1xPhospho [S1253]; 1xPhospho [S1748]|
|A0A1B0GX95 2xPhospho [S24; S26]|

I want to separate the rows where there are two accession code IDs. Although the IDs are separated by ';' , I need to take into account that some IDs may have a ';' in it such as the third row in the column above. The only way I can see to distinguish the separation if I have a condition that states if it has '];' followed by a letter, split the row. However, I don't know how to go about this.

So in the example column above, I want to achieve:

|NEW.ID|
|------|
|P02538 [551-559]|
|P04259 [551-559]|
|A0A0B4J2F2 1xPhospho [T473]|
|Q8IVF2 1xPhospho [S1253]; 1xPhospho [S1748]|
|A0A1B0GX95 2xPhospho [S24; S26]|

So the original first row is split into two. Any help would be much appreciated and please say if further clarification is required (I am still relatively new to stackoverflow).

CodePudding user response：

We may use separate_rows with a regex lookaround - i.e. split at the ; followed by a space ( ) that succeeds a closing bracket (]) and before an upper case letter

library(tidyr)
separate_rows(df1, NEW.ID, sep = "(?<=\\]); (?=[A-Z])")

-output

# A tibble: 5 × 1
  NEW.ID                                     
  <chr>                                      
1 P02538 [551-559]                           
2 P04259 [551-559]                           
3 A0A0B4J2F2 1xPhospho [T473]                
4 Q8IVF2 1xPhospho [S1253]; 1xPhospho [S1748]
5 A0A1B0GX95 2xPhospho [S24; S26]

data

df1 <- structure(list(NEW.ID = c("P02538 [551-559]; P04259 [551-559]", 
"A0A0B4J2F2 1xPhospho [T473]", "Q8IVF2 1xPhospho [S1253]; 1xPhospho [S1748]", 
"A0A1B0GX95 2xPhospho [S24; S26]")), class = "data.frame", 
row.names = c(NA, 
-4L))