Home > Mobile >  R multiple regular expressions, dataframe column names
R multiple regular expressions, dataframe column names

Time:01-22

I have a dataframe data with a lot of columns in the form of

  ...v1...min ...v1...max ...v2...min ...v2...max
1       a           a           a           a
2       b           b           b           b
3       c           c           c           c

where in place ... there could be any expression.

I would like to create a function createData that takes three arguments:

  • X: a dataframe,

  • cols: a vector containing first part of the column, so i.e. c("v1", "v2")

  • fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")

and returns filtered dataframe, so - for example:

createData(X, c("v1"), None) would return this kind of dataframe:

  ...v1...min ...v1...max 
1       a           a    
2       b           b  
3       c           c 

while createData(X, c("v1", "v2"), c("min")) would give me

  ...v1...min ...v2...min 
1       a           a    
2       b           b  
3       c           c 

At this point I decided I need to use i.e. select(contains()) from dplyr package.

createData <- function(data, fun, cols)
{
  X %>% select(contains())
  return(X)
}

What I struggle with is:

  • how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,

  • how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?

I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!

EDIT:

An reproducible example:

data = data.frame(AXv1c2min = c(1,2,3),
           subv1trwmax = c(4,5,6),
           ss25v2xxmin = c(7,8,9),
           cwfv2urttmmax = c(10,11,12))

CodePudding user response:

If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:

We can filter for (v1 OR v2) AND min like this:

library(tidyverse)

data %>%
    select(contains(c('v1','v2'))) %>%
    select(contains('min'))

  AXv1c2min ss25v2xxmin
1         1           7
2         2           8
3         3           9

So as a function where either argument is optional:

createData <- function(data, fun=NULL, cols=NULL) {
    if (!is.null(fun)) data <- select(data, contains(fun))
    if (!is.null(cols)) data <- select(data, contains(cols))
    return(data)
}

A series of examples:

createData(data, cols=c('v1', 'v2'), fun='min')
  AXv1c2min ss25v2xxmin
1         1           7
2         2           8
3         3           9

createData(data, cols=c('v1'))
  AXv1c2min subv1trwmax
1         1           4
2         2           5
3         3           6

createData(data, fun=c('min'))
  AXv1c2min ss25v2xxmin
1         1           7
2         2           8
3         3           9

createData(data, cols=c('v1'), fun=c('min', 'max'))
  AXv1c2min subv1trwmax
1         1           4
2         2           5
3         3           6

createData(data, cols=c('v1'), fun=c('max'))
  subv1trwmax
1           4
2           5
3           6
  •  Tags:  
  • Related