Home > database >  Duplicate values in R and Python
Duplicate values in R and Python

Time:01-24

I just have a quick question here.

I am trying to find if there are duplicate values in a vector in R. For example, given the vector below:

numbers <- c(10, 45, 32, 10, 56, 43, 32, 9)

I want to create a for-loop nested within a for-loop to find if any values occur more than once (i.e. 10 & 32).

I want to do it with nested for loops, since I want to practice doing this with R.

I have accomplished a working script in Python that can succesfully find duplicate values:

numbers = [10, 45, 32, 10, 56, 43, 32, 9]

def similars(ourlist, container):

for i in range(len(ourlist)):

    k = i 1

    for j in range(k, len(ourlist)):

        if ourlist[i]==ourlist[j] and ourlist[i] not in container:

            container.append(ourlist[i])

return container

container1=[]

similars(numbers, container1)

print(container1)

The above is the Python code, and it finds the duplicate values 10 & 32 when I print (and excuse the possible indentation errors, the indentations sort of messed up when transporting it to StackOverflow :-) ).

I have some R code below that attempts to do the same:

numbers <- c(10, 45, 32, 10, 56, 43, 32, 9)

similars <- function(ourlist, container){

for (i in 1:length(ourlist)){

k <- i 1

for (j in k:length(ourlist)){

if (ourlist[i] == ourlist[j] & !(ourlist[i] %in% container)){

container[i] <- ourlist[i]

}
}
}
return (container)
}



container1 <- c()

similars(numbers, container1)

print(container1)

However, I get the following error message when I attempt to run it:

Error in if (ourlist[i] == ourlist[j] & !(ourlist[i] %in% container)) { : 
  missing value where TRUE/FALSE needed
Calls: similars
Execution halted

I feel there is a simple answer to it, but grudginly, I do not seem to be able to intuit it. Do any of you know why it says "missing value where TRUE/FALSE needed" when the same error does not occur in Python, and perhaps also how to fix the error in R?

In advance, thank you.

Best regards

CodePudding user response:

  1. Your inner loop is extending beyond the length of ourlist. With this example i will iterate from 1 to 8 (length(ourlist)); on the last iteration when i is 1, then you call k <- i 1, making it 9. You then iterate j from k to length(ourlist) which evaluates to 9:8 (a decreasing sequence, length 2).

    The answer, knowing that you want to compare an element with the element(s) after it, is that your i must iterate up to but not including length(ourlist). In that way, your k <- i 1 will never be longer than the length of ourlist.

    A literal fix for that:

    similars <- function(ourlist, container){
      for (i in 1:(length(ourlist)-1)) {
        k <- i 1
        for (j in k:length(ourlist)){
          if (ourlist[i] == ourlist[j] & !(ourlist[i] %in% container)){
            if (is.na(ourlist[i])) browser()
            container[i] <- ourlist[i]
          }
        }
      }
      return (container)
    }
    similars(numbers, container1)
    # [1] 10 NA 32
    
    1. Next issue: why the NA? That's because you are assigning to the output at index i, not necessarily "append one element to the output". Let's do the append:
    similars <- function(ourlist, container){
      for (i in 1:(length(ourlist)-1)) {
        k <- i 1
        for (j in k:length(ourlist)){
          if (ourlist[i] == ourlist[j] & !(ourlist[i] %in% container)){
            if (is.na(ourlist[i])) browser()
            container <- c(container, ourlist[i])
            # container[i] <- ourlist[i]
          }
        }
      }
      return (container)
    }
    similars(numbers, container1)
    # [1] 10 32
    
  2. (Minor.) Inside an if clause, the conditional must always be length-1. Use && instead of &.

    if (ourlist[i] == ourlist[j] && !(ourlist[i] %in% container)){
    

    Why? Primarily for short-circuiting. & and | are vectorized, which means it accepts something like c(TRUE,FALSE) | c(FALSE, TRUE), and it always iterates all aspects of both sides. && is single only, but it short-circuits such that if the first resolves perfectly then the second will not even attempt to evaluate. Examples:

    TRUE || stop("oops")
    # [1] TRUE
    FALSE && stop("oops")
    # [1] FALSE
    TRUE && stop("oops")
    # Error: oops
    
  3. (Minor.) Passing container seems unnecessary here. R passes by-reference, so it is not as if you are pre-allocating memory here. I suggest you remove it from the argument list, and pre-define it in the function.

    similars <- function(ourlist) {
      container <- c()
      for (i in 1:(length(ourlist)-1)) {
        k <- i 1
        for (j in k:length(ourlist)){
          if (ourlist[i] == ourlist[j] && !(ourlist[i] %in% container)){
            if (is.na(ourlist[i])) browser()
            container <- c(container, ourlist[i])
            # container[i] <- ourlist[i]
          }
        }
      }
      return (container)
    }
    
  4. (More minor.) Let's think along the computer-science-y (CS) lines of "allow 0 or more". In this sense, is it "reasonable" to pass an empty vector? If that is given as the argument, then one might expect an empty vector be returned as well. However ... 1:length(.) will not work here. Demo:

    vec <- 2:4
    1:length(vec)
    # [1] 1 2 3
    seq_along(vec)
    # [1] 1 2 3
    seq_len(length(vec))
    # [1] 1 2 3
    
    vec <- c()
    1:length(vec)
    # [1] 1 0             # this is broken
    seq_along(vec)
    # integer(0)
    seq_len(length(vec))
    # integer(0)
    

    I suggest you use seq_len(length(ourlist)) (or length(.)-1), making the final version in this answer:

    similars <- function(ourlist) {
      container <- c()
      for (i in seq_len(max(0, length(ourlist)-1))) {
        k <- i 1
        for (j in (k-1)   seq_len(max(0, length(ourlist)-(k-1)))) {
          if (ourlist[i] == ourlist[j] && !(ourlist[i] %in% container)){
            if (is.na(ourlist[i])) browser()
            container <- c(container, ourlist[i])
            # container[i] <- ourlist[i]
          }
        }
      }
      return (container)
    }
    similars(numbers)#, container1)
    # [1] 10 32
    similars(c())
    # NULL
    

CodePudding user response:

The loop can be a single loop instead of nested - loop over the sequence from the 2nd element to the last (length), then if the current element ourlist[i] is present %in% the sequence of previous elements and not (!) present in the storage container, concatenate (c) with the 'container' with the current element and update by assignment (<-)

similars <- function(ourlist, container){
  for(i in 2:length(ourlist)) {  
      if(ourlist[i] %in% ourlist[seq(i-1)] & !(ourlist[i] %in% container)) {
         container <- c(container, ourlist[i])      
        }
     }
  
     container   
   }

-testing

> container1 <- c()
> similars(numbers, container1)
[1] 10 32

Here, we don't want to use a nested loop because %in% is vectorized and thus save a lot of unnecessary iterations


It can be done in a more easier way with duplicated in R

> numbers[duplicated(numbers)]
[1] 10 32 

Regarding why there is an error, it is already specified in the comments Regarding the issue in code your outer loop will be till the last element, then you are assigning k <- i 1, which will be outside the index

  •  Tags:  
  • Related