Home > Mobile >  In R methods to reduce possible error when labeling a factor with many levels
In R methods to reduce possible error when labeling a factor with many levels

Time:01-19

I have a variable with 75 levels, that I would like to format. However, I find it difficult to do so without formatting a level wrong.

As you know creating a factor with its levels is done like this:

df$f <- factor(df$f, levels=c('a','b','c'),
  labels=c('Treatment A','Treatment B','Treatment C'))

Is this there a way to code this differently so that the label is written next to the level. I'm looking for a code in this structure:

'a' = 'Treatment A'
'b' = 'Treatment B'
'c' = 'Treatment C'

Thanks in forward

CodePudding user response:

You could use a named vector for your level-label-pairs and convert to a factor like so:

foo <- c("a", "c", "b")

rec <- c(
  "a" = "Treatment A",
  "b" = "Treatment B",
  "c" = "Treatment C"
)

factor(foo, levels = names(rec), labels = rec)
#> [1] Treatment A Treatment C Treatment B
#> Levels: Treatment A Treatment B Treatment C

CodePudding user response:

If you have a long list of equivalences it's generally a good workflow to include it as a separate file, e.g. icdcodes.csv containing

code,descr
C00.0,Upper lip cancer
C00.1,Lower lip cancer
...

Then you could do:

codeinfo <- read.csv("icdcodes.csv")
factor(foo, levels = codeinfo$code, labels = codeinfo$descr

Ideally, you could even get the ICD10 descriptions straight from the CDC (although in practice this probably doesn't work because the descriptions are longer than yours, e.g. C000 is "Malignant neoplasm of external upper lip", not "Upper lip cancer" ...) [Also note that the CDC file doesn't have a dot separator in the codes, C0000 rather than C00.00]

icd_url <- "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/2022/icd10cm_codes_2022.txt"
codeinfo <- read.fwf(icd_url, widths = c(8,100))
names(codeinfo) <- c("code", "descr")
codeinfo$code <- trimws(codeinfo$code)
  •  Tags:  
  • Related