I want to clean a column of an R dataframe that contains a mixture of values:
| chromosome_count |
|---|
| 26 |
| 54 |
| c.36 |
| 28-30 |
| 12, 24 |
so that it looks like this, with comma separated values split into two rows and keeping only the minimum values where a range is recorded:
| chromosome_count |
|---|
| 26 |
| 54 |
| 36 |
| 28 |
| 12 |
| 24 |
I'm a very stumped beginner and any advice would be very appreciated.
CodePudding user response:
require(tidyverse)
df <- tibble(
chromo = c(26,
54,
"c.36",
"28-30",
"12, 24")
)
df %>%
separate_rows(chromo) %>%
mutate(chromo = chromo %>%
str_replace_all("[^0-9]", ""))
# A tibble: 7 x 1
chromo
<chr>
1 26
2 54
3 36
4 28
5 30
6 12
7 24
CodePudding user response:
You could use regular expressions. ie remove from the string the -30 ie use a look behind, and if its a number, delete the end part of the range. This solution assumes the range is ordered min-max. Also delete anything from the start of a line that is not a digit
df %>%
mutate(chromosome_count = str_remove(chromosome_count, "(?<=\d)-\\d |^\\D ")) %>%
separate_rows(chromosome_count, convert = TRUE)
# A tibble: 6 x 1
chromosome_count
<int>
1 26
2 54
3 36
4 28
5 12
6 24
