I have a dataframe that contains genes that are coded like P95, P104, etc. The number reflects the order in a gene name list
Gene names list (There are 2000 of them):
How to change the P## in the dataframe into gene names in this case?
UPD: here is an example dataframe and a gene names list:
gene <- c("(P10->UP)", "(P2->UP, P9->UP)", "(P10->UP, P3->UP)", "(P5->NORM, P7->UP)")
support <- c(0.95, 0.94, 0.93, 0.92)
df <- data.frame(gene, support)
gene_list <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
P10 corresponds to the 10th gene "J", P2 is "B", etc.
The result I want to obtain should look like this:
CodePudding user response:
One way might be with mutate and separate from the tidyverse package.
separate can't split a column into an unkown numbers of columns. Therefore I had to calculate the maximal number of genes in the gene column first (max_genes).
Data
gene <- c("(P10->UP)", "(P2->UP, P9->UP)", "(P10->UP, P3->UP)", "(P5->NORM, P7->UP)")
support <- c(0.95, 0.94, 0.93, 0.92)
df <- data.frame(gene, support)
gene_list <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
Code
# calculate max number of genes in column gene for spreading
max_genes = ncol(str_extract_all(df$gene, "->", simplify = T))
df %>%
# remove brackets and spaces in column gene
mutate(gene = str_remove_all(gene, "[(|)|\\s]")) %>%
# separate gene into name and expresssion
separate(col = gene,
sep = "->|,",
into = paste0(c("gene_name", "exp"),
rep(1:max_genes, each = 2)),
fill = "right") %>%
# substitute gene number with gene name
mutate(across(starts_with("gene_name"), ~gene_list[as.numeric(str_remove(., "P"))]))
Output
gene_name1 exp1 gene_name2 exp2 support
1 J UP <NA> <NA> 0.95
2 B UP I UP 0.94
3 J UP C UP 0.93
4 E NORM G UP 0.92



