R - NAs turn columns into character class (should be integer/ numeric)-CodePudding

I imported a huge dataset with a lot of missing values or N/As, NAs.

This is how i import the Data:

Databsp<-read.csv("C:/Users/adminfor/Desktop/Neuer Ordner/Pseudonymized-Genet-Treatment-Summary-20220201120538.csv", na.strings=TRUE)

Next, I converedt all the NAs or N/As to using the following code:

a <- Databsp %>% replace_with_na_all(condition = ~.x %in% common_na_strings)

Now my question: Why are columns that only include numbers and NAs from the class "character" and not "integer/numeric". I tried several codes, but nothing seems to help...

CodePudding user response：

You don't change your column classes. When you import your data, the column classes are first set, and you do nothing to change them. If a column in your CSV file has only numeric and NA values when you import it, it will be numeric. But if it has strings (including strings that you haven't yet told R are NA-equivalent, like "N/A") then read.csv must read them as character class because they are not numeric. Later, you replace those NA-equivalent values with actual NAs, but that replaces values only, it does not change the class of the columns.

The bad solution would be to patch this. Add an extra step after you replace the NA values, you could use the type.convert() function to re-assess the columns and convert them as necessary, a <- type.convert(a).

The better solution is to give read.csv your list of NA-equivalent strings when you read in the data. This is what the na.strings argument is supposed to be. From ?read.csv

na.strings
a character vector of strings which are to be interpreted as NA values.

So change your import line to

Databsp <- read.csv(
  "C:/Users/adminfor/Desktop/Neuer Ordner/Pseudonymized-Genet-Treatment-Summary-20220201120538.csv",
   na.strings = common_na_strings
)

And then the columns should be classed appropriately when you read them in, and you can skip the replace_with_na_all step as it is already taken care of. Relatedly, your current na.strings = TRUE does nothing because TRUE is not a character vector.