Home > Software engineering >  Regex for attribute value having quotes in between same as the enclosing quotes
Regex for attribute value having quotes in between same as the enclosing quotes

Time:01-19

The string has multiple occurences of alt attr key-value. In the value of the alt attr is the string having double quotes (") present in it. This double quote is making the value terminate at the first occurrence of double quote instead of taking the full value. Please help to modify the regex to achieve the full alt value

$text = 'advcd<img loading="lazy"  alt="chi-phi-sinh-o-benh-v"ien-dai-hoc-y-duoc-co-so-2" attr="val"><img loading="lazy"  alt="abcd-sinh-o-benh-"ien-dai-hoc-y-duoc-co-so-3">sdfs';

preg_match_all('/(alt)=(["\'][^"\']*["\'])/i', $text, $matches);

if (count($matches) > 1) {
    print_r($matches);
}

Current Output:

Array
(
    [0] => Array
        (
            [0] => alt="chi-phi-sinh-o-benh-v\"
            [1] => alt="abcd-sinh-o-benh-\"
        )

    [1] => Array
        (
            [0] => alt
            [1] => alt
        )

    [2] => Array
        (
            [0] => "chi-phi-sinh-o-benh-v\"
            [1] => "abcd-sinh-o-benh-\"
        )

)

Expected Output:

Array
(
    [0] => Array
        (
            [0] => alt="chi-phi-sinh-o-benh-v"ien-dai-hoc-y-duoc-co-so-2"
            [1] => alt="abcd-sinh-o-benh-"ien-dai-hoc-y-duoc-co-so-3"
        )

    [1] => Array
        (
            [0] => alt
            [1] => alt
        )

    [2] => Array
        (
            [0] => "chi-phi-sinh-o-benh-v"ien-dai-hoc-y-duoc-co-so-2"
            [1] => "abcd-sinh-o-benh-"ien-dai-hoc-y-duoc-co-so-3"
        )

)

CodePudding user response:

It seems the structure is wrong and before " the \ should be added. But the following regex leads to a solution.

(alt)=((["\']).*?[^\\]\3)(?:\s|>)

\3: matches to 3rd match group. It is used because the value should end with the same sign that started with (" or ').

[^\\]\3: Before the end quotation sign, \ is escaped the closing.

(?:\s|>) after " or ' a space or '>' is required.

https://www.phpliveregex.com/p/DmU

CodePudding user response:

You can convert the " in the attribute value to &quot; and then it is easier to use a dom parser to get the alt values:

$text = 'advcd<img loading="lazy"  alt="chi-phi-sinh-o-benh-v&quot;ien-dai-hoc-y-duoc-co-so-2" attr="val"><img loading="lazy"  alt="abcd-sinh-o-benh-&quot;ien-dai-hoc-y-duoc-co-so-3">sdfs';
$dom = new DOMDocument();
$dom->loadHTML($text, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXpath($dom);

foreach($xpath->evaluate("//img/@alt") as $i) {
    echo $i->nodeValue . PHP_EOL;   
}

Output

chi-phi-sinh-o-benh-v"ien-dai-hoc-y-duoc-co-so-2
abcd-sinh-o-benh-"ien-dai-hoc-y-duoc-co-so-3

Using a regex for your examples strings:

  • (alt)= Capture group 1, match alt followed by =
  • ( Capture group 2
    • ".*?" match from " and then the least amount of characters till the next "
    • (?= Positive lookahead
      • \s* Match optional whitespace chars
      • (?:[^\s=] ="|>) Match either non whitespace chars except the = until you match the = and " OR match >
    • ) Close lookahead
  • ) Close group 2

Php demo | regex demo

$text = 'advcd<img loading="lazy"  alt="chi-phi-sinh-o-benh-v"ien-dai-hoc-y-duoc-co-so-2" attr="val"><img loading="lazy"  alt="abcd-sinh-o-benh-"ien-dai-hoc-y-duoc-co-so-3">sdfs';

preg_match_all('/(alt)=(".*?"(?=\s*(?:[^\s=] ="|>)))/i', $text, $matches);

if (count($matches) > 1) {
    print_r($matches);
}

Output

Array
(
    [0] => Array
        (
            [0] => alt="chi-phi-sinh-o-benh-v"ien-dai-hoc-y-duoc-co-so-2"
            [1] => alt="abcd-sinh-o-benh-"ien-dai-hoc-y-duoc-co-so-3"
        )

    [1] => Array
        (
            [0] => alt
            [1] => alt
        )

    [2] => Array
        (
            [0] => "chi-phi-sinh-o-benh-v"ien-dai-hoc-y-duoc-co-so-2"
            [1] => "abcd-sinh-o-benh-"ien-dai-hoc-y-duoc-co-so-3"
        )

)
  •  Tags:  
  • Related