Regex on ByteArray that Starts With Either Or-CodePudding

I'm trying to build a regex with bytearray. I have the two kind of bytearrays as

data1 = b'\xa0\xa0\xa0\x81\x01\x04\x07\x00\x00\x0f2\x8e\xa0\xa0'
data2 = b'\xa0\x81\x01\x04\x07\x00\x00\x0f2\x8e\xa0\xa0'

The difference between data1 and data2 is the \xa0\xa0\xa0 (data1 has triple 0xA0) and \xa0 (data2 has single 0xA0).

What I need is to get the data as is (starting with \xa0 to the end \xa0) and a way to distinguish the data to see if it starts with triple 0xA0 or a single 0xA0.

When I build the regex as

matches = re.search(b'\xa0(. ?)\xa0', data2, re.IGNORECASE)

It works with data2. But I can't know if it's a single or triple data. And It doesn't work with data1 (it returns as \xa0\xa0\xa0)

What doesn't work:

matches = re.search(b'\xa0\xa0\xa0(. ?)\xa0', data2, re.IGNORECASE)
matches = re.search(b'\xa0((\xa0\xa0))?(. ?)\xa0', data1, re.IGNORECASE)

How can I get the whole data with a regex and also check if it starts with triple or single 0xA0?

Thank you for your help,

CodePudding user response：

You can use an additional capturing group to capture two more \xa0 and once there is a match, check the group. If it is None, this is Type 2, else, it is Type 1:

b'^\xa0(\xa0\xa0)?(. ?)\xa0'

In Python:

import re
rx = b'^\xa0(\xa0\xa0)?(. ?)\xa0'
m = re.search(rx, data1, re.IGNORECASE)
if m:
    if m.group(1):
        print("This is data of Type 1")
    else:
        print("This is data of Type 2")

# => This is data of Type 1

I assume your matches happen at the start of string. If it is not always the case, you will need to replace ^ with a negative lookbehind:

b'(?<!\xa0)\xa0(\xa0\xa0)?(. ?)\xa0'

The (?<!\xa0) pattern is a negative lookbehind that fails the match if the current location is immediately preceded with the lookbehind pattern (\xa0 is a soft/non-breaking space).

CodePudding user response：

Here's a modification of your first regex that:

uses a greedy match (. ) instead of non-greedy(. ?), and to start; and
looks for 1-3 \xa0, whichever is longer, to start the string.

Your first regex doesn't work because, being non-greedy, captures the shortest string between \xa0 and the next \xa0, which is just \xa0\xa0\xa0. After you can use startswith to see which kind of data it is:

# get contents
matches = re.search(b'\xa0{1,3}(. )\xa0', data`, re.IGNORECASE)

# check type
is_like_data1 = data1.startswith(b'\xa0'*3)

CodePudding user response：

Your solution of

matches = re.search(b'\xa0\xa0\xa0(. ?)\xa0', data2, re.IGNORECASE)

does work, however, you've applied it on data2 rather than data1.

That regex will find a match for data1, but not for data2, like intended. You can use this regex first. If it matches, it'll be the "triple". Then, you can apply the more general regex on the remaining bytearrays, and if they match, they'll be the "singles".