Home > Back-end >  Convert strings with an unknown number of hex strings embedded in them to strings using regex
Convert strings with an unknown number of hex strings embedded in them to strings using regex

Time:01-27

So I have a list of strings (content from Snort rules), and I am trying to convert the hex portions of them to UTF-8/ASCII, so I can send the content over netcat.

The method I have now works fine for strings with single hex characters (i.e. 3A), but breaks when there's a series of hex characters (i.e. 3A 4B 00 FF)

My current solution is:

import re
import codecs

def convert_hex(match):
  string = match.group(1)
  string = string.replace(" ", "")
  decode_hex = codecs.getdecoder("hex_codec")
  try:
    result = decode_hex(string)[0]
  except:
    result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')
  return result.decode("utf-8")


strings = ['|0A|Referer|3A| res|3A|/C|3A|', 'RemoteNC Control Password|3A|', '/bbs/search.asp', 'User-Agent|3A| Mozilla/4.0 |28|compatible|3B| MSIE 5.0|3B| Windows NT 5.0|29|']

converted_strings = []

for string in strings:
    for i in range(len(string)):
        string = re.sub(r"\|(.{2})\|", convert_hex, string)
    converted_strings.append(string)

For the strings in strings, this works, but for a string like:

|08 00 00 00 27 C7 CC 6B C2 FD 13 0E|

it breaks.

I tried changing the regex to:

re.sub(r"\|.*([A-Fa-f0-9]{2}).*\|")

but that only converts the last hex.

I need this solution to work for strings like Hello|3A|World, |3A 00 FF|, and Hello|3A 00|World

I know it's an issue with the regexp, but I'm not sure what exactly.

Any help would be much appreciated.

CodePudding user response:

It looks like a substring is either always hex i.e. (?:[A-Fa-f0-9]{2}\s) [A-Fa-f0-9]{2} or not hex at all between | symbols?

This works:

for string in strings:
    for i in range(len(string)):
        string = re.sub(r"(?<=\|)((?:[A-Fa-f0-9]{2}\s)*[A-Fa-f0-9]{2})(?=\|)", convert_hex, string)
    converted_strings.append(string)

(extra parentheses for a capturing group 1 - you could leave out one pair of parentheses and change your function to act on group(0) instead)

But it breaks on your example |08 00 00 00 27 C7 CC 6B C2 FD 13 0E|, as that doesn't appear to be a valid UTF-8 encoding. The resulting error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 5: invalid continuation byte

However, a valid UTF-8 encoded multi-byte string like '|74 65 73 74 20 f0 9f 98 80|' works just fine:

import re
import codecs

def convert_hex(match):
  string = match.group(1)
  string = string.replace(" ", "")
  decode_hex = codecs.getdecoder("hex_codec")
  try:
    result = decode_hex(string)[0]
  except:
    result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')
  return result.decode("utf-8")


strings = ['|74 65 73 74 20 f0 9f 98 80|']

converted_strings = []

for string in strings:
    for i in range(len(string)):
        string = re.sub(r"(?<=\|)((?:[A-Fa-f0-9]{2}\s)*[A-Fa-f0-9]{2})(?=\|)", convert_hex, string)
    converted_strings.append(string)

print(converted_strings)

Result:

['|test            
  •  Tags:  
  • Related