Home > Software engineering >  extracting code from string using regex in python
extracting code from string using regex in python

Time:01-15

I'm trying to extract assembly code from string but the regex was not right because i only can extract the opcodes not the instruction code

import re
text = """
┌ 38: fcn.00014840 ();
│           ; var int64_t var_38h @ rsp 0xffffffd0
│           0x00014840      53             push rbx
│           0x00014841      31f6           xor esi, esi
│           0x00014843      31ff           xor edi, edi
│           0x00014845      e846f2feff     call sym.imp.getcwd
│           0x0001484a      4885c0         test rax, rax
│           0x0001484d      4889c3         mov rbx, rax
│       ┌─< 0x00014850      740e           je 0x14860
│       │   ; CODE XREF from fcn.00014840 @ 0x14868
│      ┌──> 0x00014852      4889d8         mov rax, rbx
│      ╎│   0x00014855      5b             pop rbx
│      ╎│   0x00014856      c3             ret
..
│      ╎│   ; CODE XREF from fcn.00014840 @ 0x14850
│      ╎└─> 0x00014860      e88beffeff     call sym.imp.__errno_location
│      ╎    0x00014865      83380c         cmp dword [rax], 0xc
│      └──< 0x00014868      75e8           jne 0x14852
└           0x0001486a      e861feffff     call fcn.000146d0
            ; CALL XREFS from fcn.00013d00 @ 0x13d9d, 0x13da8
"""

print("\n".join(re.findall('0x[0-9a-fA-F]{8}[0-9a-fA-F](.*?)',text)))

so I want the output like this:

push rbx
xor esi, esi
xor edi, edi
call sym.imp.getcwd
test rax, rax
mov rbx, rax
je 0x14860
mov rax, rbx
pop rbx
ret
call sym.imp.__errno_location
cmp dword [rax], 0xc
jne 0x14852
call fcn.000146d0

CodePudding user response:

You can try this:

out = "\n".join(re.findall(r"0x[0-9a-fA-F]{8}  [^ ]   ([a-z].*)", text))
print(out)

It gives:

push rbx
xor esi, esi
xor edi, edi
call sym.imp.getcwd
test rax, rax
mov rbx, rax
je 0x14860
mov rax, rbx
pop rbx
ret
call sym.imp.__errno_location
cmp dword [rax], 0xc
jne 0x14852
call fcn.000146d0

CodePudding user response:

You can use re.sub to convert matches of the following regular expression to empty stings:

(?m)^(?:(?!.{12}0x[\da-fA-F]{8}).*\r?\n|.{43})

Python regex <¯\(ツ)> Python code

The regular expression can be broken down as follows (one may also hover the cursor over each part of the expression at the "Python regex" link to obtain an explanation of its function).

(?m)              # set multiline flag causing '^' and '$' to match
                  # the beginning and end of each line respectively
^                 # match beginning of line
(?:               # begin non-capture group
  (?!             # begin negative lookahead
    .{12}0x       # match 12 characters followed by '0x'
    [\da-fA-F]{8} # match 8 characters contained in the character class
  )               # end negative lookahead
  .*\r?\n         # match the entire line including the terminator
|                 # or
  .{43}           # match 43 characters
)                 # end non-capture group
  •  Tags:  
  • Related