I am working on 16s data and try to format an OTU table to upload it to a different tool.
but it is supposed to look like that
So I need R to count the number of semicolons ";" in each cell in column "taxonomy" and if the number is smaller than 6 I need R to add the required number of semincolons to make it six semicolons per cell. I am new to the bioinformatics field so any help would be much appreciated!
I tried
ifelse(str_count(ASV$taxonomy, ";") >= 6, ASV$taxonomy, paste0(ASV$taxonomy, " ;"))
but I don´t know how I can tell R to add so many semicolons that it makes 6 semicolons in each cell.
Thank you in advance, Lea
CodePudding user response:
Since I don't have your dataset, I just made an example dataframe.
Next time when you ask, make sure you don't post image of codes/dataset, you should use dput(your_data) and paste the result in the question.
Input
library(tidyverse)
df <- tibble(Name = LETTERS[1:4],
OTU = c("A;", "A; B;", "A; B; C; D; E; F;", "A; B; C; D; E;"))
df
# A tibble: 4 x 2
Name OTU
<chr> <chr>
1 A A;
2 B A; B;
3 C A; B; C; D; E; F;
4 D A; B; C; D; E;
Code and output
First count the number of ";" in the column, if it's fewer than 6, add some more at the end of the string in OTU. If it is not fewer than 6, use the original OTU value.
This will entirely replace the original OTU column.
df %>% mutate(OTU = ifelse(str_count(OTU, ";") < 6,
paste(OTU, str_dup("; ", 6 - str_count(OTU, ";"))),
OTU))
# A tibble: 4 x 2
Name OTU
<chr> <chr>
1 A "A; ; ; ; ; ; "
2 B "A; B; ; ; ; ; "
3 C "A; B; C; D; E; F;"
4 D "A; B; C; D; E; ; "
CodePudding user response:
We could use separate with the fill argument from tidyr package and
then paste them all together and finally replace NA by ""
library(tidyverse)
df %>%
separate(col1, c("a","b","c","d","e","f","g"), fill = "right", sep = ";") %>%
mutate(col1 = paste(a,b,c,d,e,f,g, sep = "; "), .keep="unused") %>%
mutate(col1 = str_replace_all(col1, "NA", ""))
col1
<chr>
1 "Bacteria; Proteobacteria; Gammaproteobacteria; ; ; ; "
2 "Bacteria; Proteobacteria; Gammaproteobacteria; ; ; ; "
3 "Bacteria; Actinobacteria; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; Rubrobacter; "
4 "Bacteria; Gemmatimonadetes; Gemm-1; ; ; ; "
5 "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales; ; ; "
6 "Bacteria; Actinobacteria; Nitriliruptoria; Nitriliruptorales; Nitriliruptoraceae; ; "
7 "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales; ; ; "
8 "Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales; ; ; "
9 "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales; ; ; "
10 "Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Kaistobacter; "
11 "Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales; ; ; "
12 "Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Oxalobacteraceae; Ralstonia; "
13 "Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Pseudonocardiaceae; ; "
14 "Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Micrococcaceae; Arthrobacter; "
data:
structure(list(col1 = c("Bacteria; Proteobacteria; Gammaproteobacteria;",
"Bacteria; Proteobacteria; Gammaproteobacteria;", "Bacteria; Actinobacteria; Rubrobacteria; Rubrobacterales; Rubrobacteraceae; Rubrobacter;",
"Bacteria; Gemmatimonadetes; Gemm-1;", "Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;",
"Bacteria; Actinobacteria; Nitriliruptoria; Nitriliruptorales; Nitriliruptoraceae;",
"Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;",
"Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales;",
"Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales;",
"Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Kaistobacter;",
"Bacteria; Actinobacteria; Thermoleophilia; Solirubrobacterales;",
"Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Oxalobacteraceae; Ralstonia;",
"Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Pseudonocardiaceae;",
"Bacteria; Actinobacteria; Actinobacteria; Actinomycetales; Micrococcaceae; Arthrobacter;"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-14L))
