Home > Software engineering >  How to use regexp in MATLAB to strictly match a substring and not a larger string containing that su
How to use regexp in MATLAB to strictly match a substring and not a larger string containing that su

Time:01-14

I would like to find whether a cell contains the substring foo and only this string (nothing before, nothing after) in a series of cells that may contain foobar.

I am currently using regexp in MATLAB and would like to tweak the searched pattern regexp to exclude cells that contain a string that contains the substring I defined.

I know it kind of goes against the very idea of regexp, but I am fairly certain there is a way to do what I want.

As a MWE, here is a snippet of the data I have (in cell format), called potentialfields:

'horaracha'
'sol'
'presmax'
'horapresmax'
'presmin'
'horapresmin'

and the regexp expression that I am currently using:

selected_fields={'sol','presmin'};
diffset=setdiff(potentialfields,selected_fields);
pattern=strjoin(diffset,'|');
idx_to_delete=~cellfun(@isempty,regexp(potentialfields,pattern));

The expected output of idx_to_delete is the following:

1 0 1 1 0 1

At the moment, the output is 1 0 1 1 1 1 because horapresmin contains presmin.

Thank you very much in advance.

CodePudding user response:

regexp is overkill here, ismember is an in-built function specifically designed for finding exact strings in a cell

idx_to_delete = ismember( potentialfields, selected_fields );

If you're really set on regexp you can use the start anchor (^) and end anchor ($) like so:

pattern = ['^(', strjoin( selected_fields, '|' ), ')$'];
idx_to_delete2 = ~cellfun( @isempty, regexp( potentialfields, pattern ) );

CodePudding user response:

You can build the word boundary based regex dynamically:

pattern = strcat('\\<(', strjoin(diffset,'|'), ')\\>')
idx_to_delete=~cellfun(@isempty,regexp(potentialfields,pattern))

With strjoin(diffset,'|'), you get the alternation pattern created, and the \<(...)\> is a grouping construct wrapped with word boundaries to only match whole words where word boundaries apply to every alternative start and end char.

  •  Tags:  
  • Related