I was wondering if there is a way to convert decimal numbers to ternary, given that there is a function intToBits for converting to binary.
I actually need to convert a character string like
> S0 <- c("Hello Stac")
to base 3. I thought to first convert it to decimal with
> S01 <- utf8ToInt(S0)
> S01
## [1] 72 101 108 108 111 32 83 116 97 99
then convert the result to base 3. I want to obtain something like this:
> S1
## [1] 2200 10202 11000 11010 11022 1012 10002 11022 10121 10200
CodePudding user response:
For practice, I guess you can try to write your own converter function like below
f <- function(x, base = 3) {
q <- c()
while (x) {
q <- c(x %% base, q)
x <- x %/% base
}
# as.numeric(paste0(q, collapse = ""))
sum(q * 10^(rev(seq_along(q) - 1)))
}
or with recursion
f <- function(x, base = 3) {
ifelse(x < base, x, f(x %/% base) * 10 x %% base)
}
then you can run
> sapply(utf8ToInt(S0),f)
[1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200
CodePudding user response:
You can use cwhmisc::int2B:
library(cwhmisc)
int2B(utf8ToInt(S0), 3)[[1]] |> as.numeric()
# [1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200
CodePudding user response:
Nice programming exercise. I have vectorized @ThomasIsCoding's answer to avoid expensive loops over strings and characters within strings. The idea is to loop over digits instead, since Unicode code points do not exceed 21 digits in any base, whereas the total number of characters in a character vector can be orders of magnitude greater.
The function below takes as arguments a character vector x, a base b (from 2 to 10), and a logical flag double. It returns a list res such that res[[i]] is an nchar(x[i])-length vector giving the base-b representation of x[i]. The list elements are double vectors or character vectors depending on double.
utf8ToBase <- function(x, b = 10, double = TRUE) {
## Do some basic checks
stopifnot(is.character(x), !anyNA(x),
is.numeric(b), length(b) == 1L,
b %% 1 == 0, b >= 2, b <= 10)
## Require UTF-8 encoding
x <- enc2utf8(x)
## Operate on concatenation to avoid loop over strings
xx <- paste(x, collapse = "")
ixx <- utf8ToInt(xx)
## Handle trivial case early
if (length(ixx) == 0L) {
el <- if (double) base::double(0L) else character(0L)
res <- rep.int(list(el), length(x))
names(res) <- names(x)
return(res)
}
## Use common field width determined from greatest integer
width <- as.integer(floor(1 log(max(ixx, 1), base = b)))
res <- rep.int(strrep("0", width), length(ixx))
## Loop over digits
pos <- 1L
pow <- b^(width - 1L)
while (pos <= width) {
quo <- ixx %/% pow
substr(res, pos, pos) <- as.character(quo)
ixx <- ixx - pow * quo
pos <- pos 1L
pow <- pow %/% b
}
## Discard leading zeros
if (double) {
res <- as.double(res)
if (b == 2 && any(res > 0x1p 53)) {
warning("binary result not guaranteed due to loss of precision")
}
} else {
res <- sub("^0 ", "", res)
}
## Return list
res <- split(res, rep.int(gl(length(x), 1L), nchar(x)))
names(res) <- names(x)
res
}
x <- c(foo = "Hello Stack Overflow!", bar = "Hello world!")
utf8ToBase(x, 2)
$foo
[1] 1001000 1100101 1101100 1101100 1101111 100000
[7] 1010011 1110100 1100001 1100011 1101011 100000
[13] 1001111 1110110 1100101 1110010 1100110 1101100
[19] 1101111 1110111 100001
$bar
[1] 1001000 1100101 1101100 1101100 1101111 100000
[7] 1110111 1101111 1110010 1101100 1100100 100001
utf8ToBase(x, 3)
$foo
[1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200
[11] 10222 1012 2221 11101 10202 11020 10210 11000 11010 11102
[21] 1020
$bar
[1] 2200 10202 11000 11000 11010 1012 11102 11010 11020 11000
[11] 10201 1020
utf8ToBase(x, 10)
$foo
[1] 72 101 108 108 111 32 83 116 97 99 107 32 79 118 101
[16] 114 102 108 111 119 33
$bar
[1] 72 101 108 108 111 32 119 111 114 108 100 33
Some caveats:
For efficiency, the function concatenates the strings in
xrather than looping over them. It throws an error if the concatenation would exceed2^31-1bytes, which is the maximum string size allowed by R.x <- strrep(letters[1:2], 0x1p 30) log2(sum(nchar(x))) # 31 utf8ToBase(x, 3)Error in paste(x, collapse = "") : result would exceed 2^31-1 bytesThe largest Unicode code point is
0x10FFFF. The binary representation of this number exceeds2^53when interpreted as decimal, so it cannot be stored in a double vector without loss of precision:x <- sub("^0 ", "", paste(rev(as.integer(intToBits(0x10FFFF))), collapse = "")) x ## [1] "100001111111111111111" sprintf("%.0f", as.double(x)) ## [1] "100001111111111114752"As a defensive measure, the function warns if
2^53is exceeded whenb = 2anddouble = TRUE.utf8ToBase("\U10FFFF", b = 2, double = TRUE)[[1]] [1] 1.000011e 20 Warning message: In utf8ToBase("\U{10ffff}", b = 2, double = TRUE) : binary result not guaranteed due to loss of precisionutf8ToBase("\U10FFFF", b = 2, double = FALSE)[[1]] [1] "100001111111111111111"
