How do you make a cumulative index based on 3 factor levels-CodePudding

As the title already suggests, I want to compute a (negativity) cumulative index based on the following 3 levels:

head(data$sentiment)
Levels:  negative neutral positive
sentiment             : Factor w/ 4 levels "","negative",..: 3 3 3 3 3

Say Negative is equivalent to 3, neutral to 2 and 1 is positive. The higher the score, the more negative. I intend to make an index from 0 to 100 - 100 being the most negative. The levels carry equal weight and are a cumulation of several sentiments on a particular day. What would be the best approach?

CodePudding user response：

Base R has a function scale for this purpose. If x is a numeric vector listing scores from a to b and you want scores from 0 to M, then you would do:

scale(x, center = a, scale = (b - a) / M)

Before you can use scale, you need to coerce your factor sentiment to a numeric vector listing the equivalent scores, like so:

set.seed(1L)
sentiment <- gl(4L, 1L, labels = c("", "negative", "neutral", "positive"))[sample(4L, size = 12L, replace = TRUE)]
sentiment
##  [1]          positive neutral           negative         
##  [7] neutral  neutral  negative negative neutral  neutral 
## Levels:  negative neutral positive
str(sentiment)
## Factor w/ 4 levels "","negative",..: 1 4 3 1 2 1 3 3 2 2 ...

scores <- c(NA, 3, 2, 1)[as.integer(sentiment)]
scores
## [1] NA  1  2 NA  3 NA  2  2  3  3  2  2

Note that we have assigned a missing value NA to the sentiment "" appearing in your factor. Now you can do:

as.double(scale(scores, center = 1, scale = (3 - 1) / 100))
## [1]  NA   0  50  NA 100  NA  50  50 100 100  50  50

Here, as.double is used only to coerce the result of scale (a 1-column matrix) to a vector.

CodePudding user response：

Assuming you have missings in your factor variable as pointed out by Mikael Jagan, we need to recode first and declare missings.

x <- as.numeric(data$sentiment) - 1 x <- x[x != 0]

One option could be:

(mean(x, na.rm = TRUE) - 1) * 50

Or if you don‘t want to aggregate all values and just take what you currently have:

(x - 1) * 50

This ensures that your new score ranges from 0 to 100.

Generally, what you might look for is min/max normalization:

https://en.m.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization)

so in your case you could start with any aggregation, like I suggested with the mean or again take your raw values.

new_x <- 0   (x - min(x)) * (100 - 0) / (max(x) - min(x))

Example:

set.seed(1)
x <- sample(1:3, 20, replace = TRUE)
new_x <- 0   (x - min(x)) * (100 - 0) / (max(x) - min(x))

x
 [1] 1 3 1 2 1 3 3 2 2 3 3 1 1 1 2 2 2 2 3 1

new_x
 [1]   0 100   0  50   0 100 100  50  50 100 100   0   0   0  50  50  50  50 100
[20]   0