MySQL REGEXP to check a string contains all these characters in any order-CodePudding

I'm trying to write a query that searches a list of words. One of the conditions is to check words that contain (in any order) a any of the characters in string.

For example the word must contain 'o' and 'd' in any order, so 'ABDOMEN' and 'ABOUND'are both correct.

My query is :

SELECT word
FROM words
WHERE lower(word) like 'ab%'                   /* Words starts with AB               */
AND   REGEXP_INSTR(lower(word), '[str]') = 0   /* does not contain any of r, s or t  */
AND   REGEXP_INSTR(lower(word), '[od]') > 0    /* must  contain both o and d         */

The problem is the 'must contain' condition, specifically getting it to check for both 'O' and 'D', whereas above it is more like 'O' or 'D'.

Experimenting, I found this works:

AND   REGEXP_INSTR(lower(word), '(o.*d|d.*o)' ) > 0    /* must  contain both o and d         */

The problem with this is I have to generate (in PHP) the (o.*d|d.*o) from the original od. When the list gets to over 3 chars generating this becomes arduous.

The other alternative is to add a separate condition for each character in the 'must contain' list:

AND   INSTR(lower(word), 'o' ) > 0    /* must  contain o          */
AND   INSTR(lower(word), 'd' ) > 0    /* must  contain d         */

However when using the bind_param call in PHP passing these in it makes the code messy.

Is there a 'one-liner' in MySQL that will achieve these above?

CodePudding user response：

The series of letters can be handled as in this example, and order of their appearance will be ignored:

WHERE REGEXP_INSTR(lower(word), '(?=.*O)(?=.*D)')

CodePudding user response：

The case sensitivity is determined by the collation on the column. Unless you have a specific reason for using a case sensitive collation, I suggest changing it to case insensitive to do away with the need to explicitly force the case. Instead of applying another function to every word you can just set the match type to case-insensitive for REGEXP_INSTR. You can also move the prefix check into the regexp -

SELECT word
FROM words
WHERE REGEXP_INSTR(word, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');

Of course, the above query cannot use any available index to filter so moving the prefix to the regex is not such a great idea. This lead me to do some tests. I took a simplified copy of my dictionary to create the following table (111,745 rows) -

CREATE TABLE `words` (
  `word_cs` varchar(25) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_cs NOT NULL,
  `word_ci` varchar(25) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NOT NULL,
  KEY `idx_word_ci` (`word_ci`),
  KEY `idx_word_cs` (`word_cs`)
) ENGINE=InnoDB;

The first batch of queries I ran against the case-sensitive word_cs and I ran each query five times and the timings are the averages -

SELECT word_cs
FROM words
WHERE REGEXP_INSTR(word_cs, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.166s */

SELECT word_cs
FROM words
WHERE REGEXP_INSTR(lower(word_cs), '(?=^ab)(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.193s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND REGEXP_INSTR(word_cs, '(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.067s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND REGEXP_INSTR(lower(word_cs), '(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.065s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND INSTR(lower(word_cs), 'o' ) > 0
AND INSTR(lower(word_cs), 'd' ) > 0;
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.064s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND lower(word_cs) LIKE '%o%'
AND lower(word_cs) LIKE '%d%';
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.063s */

I then ran a similar (slightly modified due to lack of case sensitivity) batch of queries against the case-insensitive word_ci -

SELECT word_ci
FROM words
WHERE REGEXP_INSTR(word_ci, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.147s */

SELECT word_ci
FROM words
WHERE REGEXP_INSTR(word_ci, '(?=^ab)(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.157s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND REGEXP_INSTR(word_ci, '(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.003s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND REGEXP_INSTR(word_ci, '(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.003s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND INSTR(word_ci, 'o' ) > 0
AND INSTR(word_ci, 'd' ) > 0;
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.001s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND word_ci LIKE '%o%'
AND word_ci LIKE '%d%';
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.001s */

	word_cs	word_ci
Query 1	0.166	0.147
Query 2	0.193	0.157
Query 3	0.067	0.003
Query 4	0.065	0.003
Query 5	0.064	0.001
Query 6	0.063	0.001

The real difference in performance is not down to the case-sensitivity of the collation, or lack thereof, but whether the query is sargable or not. The use of LOWER() on the column value negates index use.

/* Full table scan */
SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'

/* Uses index if available */
SELECT word_cs
FROM words
WHERE word_cs LIKE 'ab%'
OR word_cs LIKE 'AB%'
OR word_cs LIKE 'Ab%'
OR word_cs LIKE 'aB%';