Distributing value in a column to values in other columns given certain rules-CodePudding

I am currently working on a programming puzzle that sounds straightforward, but apparently it is pretty difficult if I want to do this efficiently in R without having to use for loop to go through a column with 100k rows within a data-frame. I am trying to apply dplyr (particularly group_by and mutate) or data.table, and -apply family, but it's quite tough. Could anyone give some help?

The problem is as follows: given a data-frame df with columns key ("string" data type), x, y, z (all are "numeric" data type). Some elements within column key might get repeated. The rule set is as follows: for every rows with the same value in key column, we determine whether their values in column x are smaller than the sum of elements in column y (for example, with key = aa_bb_1, there are 6 rows with this key, and all of these rows always have the same value in column x. Please see the Sample Output to see how the rule works). If it is, then keep that value in column x , while distributing the element in column x to elements in column y in a decreasing order based on corresponding values in column z. How do we effectively do this given that we need to go through all distinct elements in column key?

Sample Input

df <- data.frame(key = c('aa_bb_1', 'aa_bb_0', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1', 
                         'aa_bb_1', 'aa_bb_1', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1', 
                         'aa_bb_0', 'aa_bb_1', 'ab_ca_0', 'abc_bbb_0', 'abbbc_aa_0', 'aaa_ccc_1', 
                         'aa_bb_0', 'aa_bb_1', 'ab_ca_1', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1', 
                         'aa_bb_1', 'aa_bb_0', 'ab_ca_0', 'abc_bbb_1', 'abbbc_aa_1', 'aaa_ccc_1'),
                     x = c(20, 19, 30, 25, 37, 13, 20, 20, 30, 25, 37, 13, 19, 20, 30, 43, 
                           71, 13, 19, 20, 10, 25, 37, 13, 20, 19, 30, 25, 37,13),
                     y = c(3, 10, 18, 15, 32, 4, 12, 29, 71, 92, 11, 7, 21, 19, 13, 
                           26,28,11,8, 8, 5, 23, 3, 12, 19, 7, 9, 11, 7, 12),
                     z = c(8,13,15,16,10,10,25,21,32,15,45,8,10,50,12,10,35,
                           23,10,12,2,40,45,57,66,49,100,5,11,30))

          key  x  y   z
1     aa_bb_1 20  3   8
2     aa_bb_0 19 10  13
3     ab_ca_0 30 18  15
4   abc_bbb_1 25 15  16
5  abbbc_aa_1 37 32  10
6   aaa_ccc_1 13  4  10
7     aa_bb_1 20 12  25
8     aa_bb_1 20 29  21
9     ab_ca_0 30 71  32
10  abc_bbb_1 25 92  15
11 abbbc_aa_1 37 11  45
12  aaa_ccc_1 13  7   8
13    aa_bb_0 19 21  10
14    aa_bb_1 20 19  50
15    ab_ca_0 30 13  12
16  abc_bbb_0 43 26  10
17 abbbc_aa_0 71 28  35
18  aaa_ccc_1 13 11  23
19    aa_bb_0 19  8  10
20    aa_bb_1 20  8  12
21    ab_ca_1 10  5   2
22  abc_bbb_1 25 23  40
23 abbbc_aa_1 37  3  45
24  aaa_ccc_1 13 12  57
25    aa_bb_1 20 19  66
26    aa_bb_0 19  7  49
27    ab_ca_0 30  9 100
28  abc_bbb_1 25 11   5
29 abbbc_aa_1 37  7  11
30  aaa_ccc_1 13 12  30

Sample Output for aa_bb_1 and aa_bb_0

          key  x  y   z
1     aa_bb_1 20  0   8
2     aa_bb_0 19 10  13 -- Second largest value of z among rows with same key aa_bb_0. Get second distribution equal to min(10,19-7)=min(10,12)=10.
7     aa_bb_1 20  0  25
8     aa_bb_1 20  0  21 -- Nothing left to be distributed => 0 in column y.
13    aa_bb_0 19  0  10 --- Nothing left so distribute 0
14    aa_bb_1 20  1  50 --- Second largest value of z among rows with same key aa_bb_1. So distribute min(19,20-19)=1 to column y. 
19    aa_bb_0 19  2  10 --- Tie as third largest value of z among rows with same key aa_bb_0. Pick *randomly* for now (in reality, I would have another column to decide on which row would get distributed first). Since min(8,19-7-10)=min(8,2)=2, only 2 is distributed.
20    aa_bb_1 20  0  12
25    aa_bb_1 20 19  66 --- Largest value of z among rows with same key aa_bb_1. Get first distribution = min(20, 19)=19.
26    aa_bb_0 19  7  49 --- Largest value of z among rows with same key aa_bb_0. Get first distribution equal to min(7,19)=7.

Caveat. Only perform the above operations if the sum of all the elements in column z with the same key is greater than the value in column x of that key. Example includes aa_bb_1 where x = 20 < 3 19 8 19

CodePudding user response：

Pretty much anything you can do with a for loop.

Here I apply a function to the data.frame split by key, that function being a for loop. Then I assign the output to the ordered df, because the split data frame loop output is ordered by key.



df <- dplyr::arrange(df, key, desc(z))
  
 df$y <- lapply(split(df, df$key), \(x) {
  
  ndf <- x
  
  base <- min(ndf$x)
  
  #out values for y
  yout = list()
  
  for (i in seq(nrow(x))) {
    ##get the max
    maxz <- which.max(ndf$z)
    
    ##get the minimum
    minv <- min(base, ndf$y[maxz])
    
    #add to yout
    yout[[i]] <- minv
    
    #new base
    base <- base - minv
    
    ##update dataframe
    ndf <- ndf[-maxz, ]
  }
  return(yout)
}) |> unlist()

          key  x  y   z
1     aa_bb_0 19  7  49
2     aa_bb_0 19 10  13
3     aa_bb_0 19  2  10
4     aa_bb_0 19  0  10
5     aa_bb_1 20 19  66
6     aa_bb_1 20  1  50
7     aa_bb_1 20  0  25
8     aa_bb_1 20  0  21
9     aa_bb_1 20  0  12
10    aa_bb_1 20  0   8
11  aaa_ccc_1 13 12  57
12  aaa_ccc_1 13  1  30
13  aaa_ccc_1 13  0  23
14  aaa_ccc_1 13  0  10