Assuming the following data, I want to count the unique characters per row.
test <- data.frame(oe = c("A-1", "111", "-", "Sie befassen sich intensiv damit"))
So I thought I'm using the [:graph:] helper to capture letters, numbers and punctuation. However, it gives the wrong results, see below:
library(tidyverse)
test %>%
mutate(unique_chars_correct = sapply(tolower(oe), function(x) sum(str_count(x, c(letters, 0:9, "-")) > 0)),
unique_chars_wrong = sapply(tolower(oe), function(x) sum(str_count(x, "[:graph:]") > 0)))
which gives:
oe unique_chars_correct unique_chars_wrong
1 A-1\\. 3 1
2 111 1 1
3 - 1 1
4 Sie befassen sich intensiv damit 13 1
I assume, using [:graph:] kind of checks if any of the chars satisfies being part of [:graph:], but want to do is to check every element that is part of [:graph:].
CodePudding user response:
The [:graph:] gives the total count and it is not differentiating the unique characters
> str_count(test$oe, "[:graph:]")
[1] 3 3 1
Thus, when we convert to a logical (> 0) and take the sum it returns just 1
and it doesn't differentiate between numbers/letters/punct.
If we need to get the expected
Reduce(` `, lapply(c("[:alpha:]", "[:digit:]", "[:punct:]"),
function(x) str_count(tolower(test$oe), x) >0) )
[1] 3 1 1
Or may split and then use [:graph:] on the unique values
sapply(strsplit(tolower(test$oe), ""), function(x)
sum(str_count(unique(x), "[:graph:]") > 0))
[1] 3 1 1
CodePudding user response:
You can use backreference and lookaround for this:
Data:
test <- data.frame(oe = c("A-1", "111", "-", "Abaa", "B cbb b"))
EDITED Solution: (also accounts for whitespace, which is not counted, as well as upper- and lower-case distinctions, which are disregarded=
library(stringr)
str_count(test$oe, "(?i)([^\\s])(?!.*\\1)")
[1] 3 1 1 2 2
How this works:
(?i): case-insensitive match([^\\s]): a capture group matching any character that is not a whitespace char(?!: the start of a negative lookahead, preventing the matching and, hence, inclusion in thestr_countoperation of what follows:.*: any character occurring zero or more times\\1: backreference recalling the exact match of the capturing group(.)and thus, in the context of the negative lookahead, effectively preventing the matching and counting of any repetitions of it): end of negative lookahead
EDIT:
alternatively you can use dplyr:
library(dplyr)
test %>%
mutate(
# set to lower-case and remove whitespace:
oe = tolower(gsub("\\s", "", oe)),
# split the strings into separate chars:
oe_splt = str_split(oe, ""),
# count unique chars:
count_unq = lengths(sapply(oe_splt, function(x) unique(x))))
