I have a dataset and consist of 10 000 rows of data. I perform the random set of 1000 sample data.
Name Age ...
Alice
Jasmine
Alice
Joel
Jimmy
Alice
Alex
Agar
Agar
When I perform the count of number of occurrence of names in a column
name <- table(example['Name'], useNA = "ifany")
The output showed a strange output. It showed a new name Bruce which has 0 value but for Bruce it is not found in the random set of 1000 data but it is instead found in the original dataset. I only want to to use the random set of 1000 data and the 0 value is it normal? How to get rid of it? Or is it impossible to get rid of it?
Alice 3
Jasmine 1
Joel 1
Agar 2
Jimmy 1
Alex 1
Bruce 0
CodePudding user response:
You may use droplevels to drop unused factor levels.
name <- table(droplevels(example['Name']))
Consider this example -
set.seed(123)
#Sample dataframe
df <- data.frame(a = factor(sample(c('A', 'B', 'C'), 10, replace = TRUE)))
#Select only first 5 rows so we don't have any row with "A" value.
df1 <- df[1:5, , drop = FALSE]
table(df1['a'])
#A B C
#0 1 4
table(droplevels(df1['a']))
#B C
#1 4
CodePudding user response:
Sounds like your name field is a factor variable. You are getting totals based on the factor levels. Note in the help text for table(), "Only when exclude is specified (i.e., not by default) and non-empty, will table potentially drop levels of factor arguments." Sounds like you may want to specify exclude to drop factor levels. Or consider refactoring the name field of the random data set with the unique values found just in that set.
