I have a dataframe that currently looks like this:
| subjectID | Trial | Score |
|---|---|---|
| 1 | 1 | 16 |
| 1 | 1 | 16 |
| 1 | 1 | 16 |
| 1 | 2 | 8 |
| 1 | 2 | 8 |
| 1 | 2 | 8 |
| 1 | 3 | 12 |
| 1 | 3 | 12 |
| 1 | 3 | 12 |
| 2 | 1 | 9 |
| 2 | 1 | 9 |
| 2 | 1 | 9 |
| 2 | 2 | 10 |
| 2 | 2 | 10 |
| 2 | 2 | 10 |
I need to create a new column, Previous_Trial_Score, that is simply the score on the last trial for each person. For example:
| subjectID | Trial | Score | Previous_Trial_Score |
|---|---|---|---|
| 1 | 1 | 16 | NA |
| 1 | 1 | 16 | NA |
| 1 | 1 | 16 | NA |
| 1 | 2 | 8 | 16 |
| 1 | 2 | 8 | 16 |
| 1 | 2 | 8 | 16 |
| 1 | 3 | 12 | 8 |
| 1 | 3 | 12 | 8 |
| 1 | 3 | 12 | 8 |
| 2 | 1 | 9 | NA |
| 2 | 1 | 9 | NA |
| 2 | 1 | 9 | NA |
| 2 | 2 | 10 | 9 |
| 2 | 2 | 10 | 9 |
| 2 | 2 | 10 | 9 |
And so on. Trial 1 for each subject will always be NA, as there is no previous trial for that person. I am writing a for-loop to accomplish this, below:
for (myperson in unique(data$subjectID)){
for (mytrial in unique(data$Trial[data$Trial>1])){
#Specify the trial and person
Prev_Score=as.numeric(unique(data[data$subjectID==myperson & data$Trial==mytrial-1, "Score"]))
#Save it to the dataframe
data[data$subjectID==myperson & data$Trial==mytrial,"Prev_Score"]=Prev_Score
}
}
In the above loop, I had to specify as.numeric and unique to get R to return a single value properly. However, when I run the loop, I get this error:
Error: Assigned data `value` must be compatible with existing data.
i Error occurred for column `Prev_Score`.
x Can't convert from <double> to <logical> due to loss of precision.
* Locations: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 3...
Does anyone have a solution? I am open to tidyverse/dplyr work-arounds.
CodePudding user response:
We could do a group by summarise to get the lag and then do a join
library(dplyr)
df1 %>%
group_by(subjectID, Trial) %>%
summarise(Previous_Trial_Score = last(Score), .groups= 'drop_last') %>%
mutate(Previous_Trial_Score = lag(Previous_Trial_Score)) %>%
left_join(df1, .)
-output
subjectID Trial Score Previous_Trial_Score
1 1 1 16 NA
2 1 1 16 NA
3 1 1 16 NA
4 1 2 8 16
5 1 2 8 16
6 1 2 8 16
7 1 3 12 8
8 1 3 12 8
9 1 3 12 8
10 2 1 9 NA
11 2 1 9 NA
12 2 1 9 NA
13 2 2 10 9
14 2 2 10 9
15 2 2 10 9
Or slightly compact option with data.table
library(data.table)
setDT(df1)[, Previous_Trial_Score := shift(.SD[, last(Score),
Trial]$V1)[Trial], subjectID]
-output
> df1
subjectID Trial Score Previous_Trial_Score
<int> <int> <int> <int>
1: 1 1 16 NA
2: 1 1 16 NA
3: 1 1 16 NA
4: 1 2 8 16
5: 1 2 8 16
6: 1 2 8 16
7: 1 3 12 8
8: 1 3 12 8
9: 1 3 12 8
10: 2 1 9 NA
11: 2 1 9 NA
12: 2 1 9 NA
13: 2 2 10 9
14: 2 2 10 9
15: 2 2 10 9
data
df1 <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), Trial = c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L), Score = c(16L, 16L, 16L,
8L, 8L, 8L, 12L, 12L, 12L, 9L, 9L, 9L, 10L, 10L, 10L)),
class = "data.frame", row.names = c(NA,
-15L))
CodePudding user response:
Here is another dplyr option, where we just keep the unique rows with distinct, then get the lag value of Score, and join back to the original dataframe. However, if you have more columns in your dataframe, then @akrun provides a cleaner approach of handling the additional columns.
library(dplyr)
df %>%
distinct() %>%
group_by(subjectID) %>%
mutate(Previous_Trial_Score = lag(Score)) %>%
left_join(df, .)
Output
subjectID Trial Score Previous_Trial_Score
1 1 1 16 NA
2 1 1 16 NA
3 1 1 16 NA
4 1 2 8 16
5 1 2 8 16
6 1 2 8 16
7 1 3 12 8
8 1 3 12 8
9 1 3 12 8
10 2 1 9 NA
11 2 1 9 NA
12 2 1 9 NA
13 2 2 10 9
14 2 2 10 9
15 2 2 10 9
Data
df <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L), Trial = c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 2L), Score = c(16L, 16L, 16L,
8L, 8L, 8L, 12L, 12L, 12L, 9L, 9L, 9L, 10L, 10L, 10L)), class = "data.frame", row.names = c(NA,
-15L))
