Home > Software design >  How would I match all "quote blocks" in plaintext e-mail in PHP PCRE?
How would I match all "quote blocks" in plaintext e-mail in PHP PCRE?

Time:01-30

I'm trying to match all the quotes in the following example e-mail message:

> Don't forget to buy eggiweggs on the way home.

I shall not.

> Also remember to brush your shoes.

Will do.

> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.

I see what you mean. They sure make a mess.

That means I want to match these three strings:

> Don't forget to buy eggiweggs on the way home.

And:

> Also remember to brush your shoes.

And:

> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.

I don't understand how I can do this, since if I use the s flag to span multiple lines, which is required for this, I cannot refer to ^ and $ to mean "beginning of line" and "end of line" -- instead, they mean "beginning of string" and "end of string".

So if I do: #^(> . ?)$#us, it will match everything after/with the first quote.

And if I do: #^(> . ?)$#um, it will match only the first quote's first line and nothing else.

This is frustrating. I really have no idea how to solve it. I've searched online before asking and found zero even remotely relevant pages as usual.

CodePudding user response:

With preg_match_all:

preg_match_all('~^> .*(?:\R> .*)*~m', $txt, $matches);
$result = $matches[0];

(where \R is an alias for several newline sequences)


With preg_split:

$result = preg_split('~^(?!> ).*\R?~m', $txt, -1, PREG_SPLIT_NO_EMPTY);

that splits the string on each line that doesn't start with > . To trim the newline at the end of each block, you can start this pattern with an optional \R? => ~\R?^(?!> ).*\R?~m or like that ~(?:\R?^(?!> ).*) \R?~m to eventually grab several lines at a time.


About \R:
\R is by default an alias for (?>\r\n|\n|\x0b|\f|\r|\x85) (any non-utf8 8bits characters sequences for a newline). In utf8 mode, with the u modifier or starting the pattern with (*UTF8)(*BSR_UNICODE), two other characters oustide of the ASCII range are added to the list: the line separator (U 2028), the paragraph separator (U 2029).
It's handy when you don't know which newline sequence is used in the string but slower than writing the exact newline sequence if you know it. You can restrict \R to (?>\r\n|\n|\r) with the directive (*BSR_ANYCRLF) at the start of the pattern.

CodePudding user response:

My idea is to split the string based on the line breaks. maybe this will help you?

foreach(explode("\n", $string) as $key=>$val) {
  if(preg_match('/^(>.*)$/', $val, $match))
    echo $match[1] . PHP_EOL;
}

output:

> Don't forget to buy eggiweggs on the way home.
> Also remember to brush your shoes.
> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.

edit: i tried something else... but it is not perfect

preg_match_all("/(>[^\n] )/sm", $string, $match);
print_r($match);

output

Array
(
    [0] => > Don't forget to buy eggiweggs on the way home.
    [1] => > Also remember to brush your shoes.
    [2] => > > > And clean up after the pigs.
    [3] => > > But I have no pigs.
    [4] => > Yes, you do. Your kids.
)

CodePudding user response:

Explicitly match the end of the quote

When a quoted block continues on multiple lines, the last characters of the line are [newline] >. When a quoted block ends the last characters of the quote are [newline] [not >]

This logic/pattern allows finding the whole quoted blocks. Using this code:

<?php

$input = <<<STUFF
> Don't forget to buy eggiweggs on the way home.

I shall not.

> Also remember to brush your shoes.

Will do.

> > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.

I see what you mean. They sure make a mess.
STUFF;

$regex = "/(>.*?)\n[^>]/s";
preg_match_all($regex, $input, $matches);

print_r($matches);

Results in:

Array
(
    [0] => Array
        (
            [0] => > Don't forget to buy eggiweggs on the way home.


            [1] => > Also remember to brush your shoes.

            [2] => > > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.


        )

    [1] => Array
        (
            [0] => > Don't forget to buy eggiweggs on the way home.
            [1] => > Also remember to brush your shoes.
            [2] => > > > And clean up after the pigs.
> > But I have no pigs.
> Yes, you do. Your kids.
        )

)

A note about the regex

/(>.*?)\n[^>]/s is:

/           # Regex start delimiter
  (         # Start of capturing group
    >       # Literal >
    .       # Any character, including newlines
    *?      # Any number of times, none-greedy match
  )         # End of Capturing group
  \n        # Newline
  [^>]      # anything _except_ a literal >
/           # Regex end delimiter
s           # PCRE_DOTALL flag (makes . also match newlines)

Choosing between (or combining) PCRE_DOTALL and PCRE_MULTILINE depends on the strategy employed - here the intent is only to modify the behavior of .. More info in the docs.

If the source text is coming from windows, you may wish to use \R (as noted in a different answer).

Why don't the attempts in the question work?

So if I do: #^(> . ?)$#us

The s modifier only affects ^ and $.

This regex is anchored to the start and end of each line, but the . will not match a newline - hence it matches each quoted line individually.

And if I do: #^(> . ?)$#um

The m modifier only affects ..

It has no effect on ^ or $ - so as noted in the question this can at most produce one match.

Flags are not mutually exclusive, and can be used in combination.

  •  Tags:  
  • Related