Home > Net >  R progressively search string of a hierarchical lookup table for matches
R progressively search string of a hierarchical lookup table for matches

Time:01-12

I am working with OPS codes which code the type of procedure performed in a hospital. The OPS coding list has a hierarchical structure of the form X-XXX.XX with X being numbers. The coding structure is hierarchical, that means, the first X- is a big set, then the XXX denote a subset type of procedure within the first X-, the last .XX denote a subspecialization of the XXX

so the code might be X-XXX, X-XXX. , X-XXX.X, X-XXX.XX

My problem is that a program we uses collapses the structure of the code to XXXX, XXXXX, or XXXXXXX and i would like to match the collapsed with the uncollapsed llokup table of definitions.

So I would like to have a routine that checks for each digit and then procedes to the next when performing the matching. grepl would not to because 5381 would match 65381 (the uncollapsed would be 5-381 and 6-538.1) which are totally different procedures. I would need something that would match character to character (first number second number etc) and respects the character positions.

When an exact match cannot be found, it should return the first match that matches the same character positions.

More examples in pseudocode

which("5381" %in% c("65381","53811", "5382")) should return 2 since the second item matches all available characters provided

which("5381" %in% c("538110","538111", "538221")) should return 1 (because its the first match, the lookup table within c() is sorted.

which("5381." %in% c("5381","538111", "538121")) should return 1 (because its the first match, the lookup table within c() is sorted. Note that the period is ignored in the match

which("5381.1" %in% c("5381","538111", "538112")) should return 2 (because its the first match that matches all available five characters and we don't have a fifth.

I know this is not the best example of a question in SO but I am open to improve the question.

CodePudding user response:

This is probably too complicated but it works.
First define a generic to transform the input string to the OPS format. Then have a matching function check if x has y as a substring.

Note that the matching function does not check if x is a substring of y, it's the other way around.

as.ops <- function(x, ...) UseMethod("as.ops")
as.ops.default <- function(x, ...){
  warning("The default method coerces its argument to character and calls the character method")
  as.ops.character(as.character(x))
}
as.ops.character <- function(x, ...){
  x <- gsub("[^[:digit:]]", "", x)
  ops1 <- substr(x, 1, 1)
  ops2 <- substr(x, 2, 4)
  ops3 <- substring(x, 5)
  y <- character(length(x))
  n <- findInterval(nchar(x), c(0, 1, 4, 7))
  y[n == 1] <- x[n == 1]
  y[n != 1] <- paste(ops1[n != 1], ops2[n != 1], sep = "-")
  o3 <- nchar(ops3) > 0
  y[n == 3 & o3] <- paste(y[n == 3 & o3], ops3[n == 3 & o3], sep = ".")
  y
}
ops_match <- function(x, y){
  xo <- as.ops(x)
  yo <- as.ops(y)
  i <- (xo %in% yo) | grepl(yo, xo)
  which(i)
}

x1 <- c("65381","53811", "5382")
x2 <- c("538110","538111", "538221")
x3 <- c("5381","538111", "538121")
x4 <- c("5381","538111", "538112")
y1 <- y2 <- "5381"
y3 <- "5381."
y4 <- "5381.1"

ops_match(x1, y1)
ops_match(x2, y2)
ops_match(x3, y3)
ops_match(x4, y4)
  •  Tags:  
  • Related