As the title already suggests, I want to compute a (negativity) cumulative index based on the following 3 levels:
head(data$sentiment)
Levels: negative neutral positive
sentiment : Factor w/ 4 levels "","negative",..: 3 3 3 3 3
Say Negative is equivalent to 3, neutral to 2 and 1 is positive. The higher the score, the more negative. I intend to make an index from 0 to 100 - 100 being the most negative. The levels carry equal weight and are a cumulation of several sentiments on a particular day. What would be the best approach?
CodePudding user response:
Base R has a function scale for this purpose. If x is a numeric vector listing scores from a to b and you want scores from 0 to M, then you would do:
scale(x, center = a, scale = (b - a) / M)
Before you can use scale, you need to coerce your factor sentiment to a numeric vector listing the equivalent scores, like so:
set.seed(1L)
sentiment <- gl(4L, 1L, labels = c("", "negative", "neutral", "positive"))[sample(4L, size = 12L, replace = TRUE)]
sentiment
## [1] positive neutral negative
## [7] neutral neutral negative negative neutral neutral
## Levels: negative neutral positive
str(sentiment)
## Factor w/ 4 levels "","negative",..: 1 4 3 1 2 1 3 3 2 2 ...
scores <- c(NA, 3, 2, 1)[as.integer(sentiment)]
scores
## [1] NA 1 2 NA 3 NA 2 2 3 3 2 2
Note that we have assigned a missing value NA to the sentiment "" appearing in your factor. Now you can do:
as.double(scale(scores, center = 1, scale = (3 - 1) / 100))
## [1] NA 0 50 NA 100 NA 50 50 100 100 50 50
Here, as.double is used only to coerce the result of scale (a 1-column matrix) to a vector.
CodePudding user response:
Assuming you have missings in your factor variable as pointed out by Mikael Jagan, we need to recode first and declare missings.
x <- as.numeric(data$sentiment) - 1 x <- x[x != 0]
One option could be:
(mean(x, na.rm = TRUE) - 1) * 50
Or if you don‘t want to aggregate all values and just take what you currently have:
(x - 1) * 50
This ensures that your new score ranges from 0 to 100.
Generally, what you might look for is min/max normalization:
https://en.m.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization)
so in your case you could start with any aggregation, like I suggested with the mean or again take your raw values.
new_x <- 0 (x - min(x)) * (100 - 0) / (max(x) - min(x))
Example:
set.seed(1)
x <- sample(1:3, 20, replace = TRUE)
new_x <- 0 (x - min(x)) * (100 - 0) / (max(x) - min(x))
x
[1] 1 3 1 2 1 3 3 2 2 3 3 1 1 1 2 2 2 2 3 1
new_x
[1] 0 100 0 50 0 100 100 50 50 100 100 0 0 0 50 50 50 50 100
[20] 0
