2 years ago

#41817

test-img

Markus AO

Regex: Case-insensitive Matching with Case-sensitive Exceptions

Use case: I have a large bulk of text I need to match, line by line, against a large list of terms to audit for consistency etc.

What I do: Take the term list, order by length (desc) to ensure longer matches precede potential substring matches, and join them into a monster \b(capture|these|words|list)\b. This is all working basically fine. Generating various case variants for each term would bloat the regex beyond ridiculous; so terms are matched as i case-insensitive.

Problem: There are some terms that should only ever be matched in uppercase or title case, e.g. IT or WHO, Will or Sandy. If every it and will is also matched, it generates a ridiculous volume of noise matches to review. Having "Name" terms confused at the start of a sentence is not really an issue here; and there's no technical way to tell them apart anyway.

Required: I need to come up with a case-insensitive regex with a large capture group, inside which indicated options are handled as case-sensitive instead. As far as I can tell, there are no such modifiers that could be attached to individual capture groups or parts thereof. What other approaches do we have in the toolbox?

Sample Code:

$rx = '~\b(iterations|coffee|random|Will|IT)\b~i';

$texts = [
    'Complex iterations, coffee and IT infrastructure', // should match "iterations", "IT"
    'We will call them Iterations later on', // should match "Iterations"
    'On matters of IT, we have pondered much', // should match "IT"
    'No matched bits and it is for the great good.' // should not match
];

$result = array_map(function($text) use ($rx) {
    preg_match_all($rx, $text, $match);
    return $match;
}, $texts);

var_dump($result);

Any solutions? Aside a second round of matching to filter out unwanted cases with a case-sensitive regex? It would mean checking for matches with an "unwanted" case of certain terms, but also do not have other "wanted" terms. Doable, but I would prefer if the business logic here could be kept to a single regex, to keep this more portable.


P.S. I've read Combine case sensitive and insensitive regex... and it doesn't cover this.

Update: It is possible to use inline modifiers in capture groups! Thanks, Wiktor. For my case, (?-i:Will|IT)) would do it. The global i flag is canceled with-i for a specific capture group. There's an excellent answer on Can you make just part of a regex case-insensitive? with more details and implementation in various languages.

php

regex

regex-group

string-matching

case-sensitive

0 Answers

Your Answer

Accepted video resources