I'm trying to extract assembly code from string but the regex was not right because i only can extract the opcodes not the instruction code
import re
text = """
┌ 38: fcn.00014840 ();
│ ; var int64_t var_38h @ rsp 0xffffffd0
│ 0x00014840 53 push rbx
│ 0x00014841 31f6 xor esi, esi
│ 0x00014843 31ff xor edi, edi
│ 0x00014845 e846f2feff call sym.imp.getcwd
│ 0x0001484a 4885c0 test rax, rax
│ 0x0001484d 4889c3 mov rbx, rax
│ ┌─< 0x00014850 740e je 0x14860
│ │ ; CODE XREF from fcn.00014840 @ 0x14868
│ ┌──> 0x00014852 4889d8 mov rax, rbx
│ ╎│ 0x00014855 5b pop rbx
│ ╎│ 0x00014856 c3 ret
..
│ ╎│ ; CODE XREF from fcn.00014840 @ 0x14850
│ ╎└─> 0x00014860 e88beffeff call sym.imp.__errno_location
│ ╎ 0x00014865 83380c cmp dword [rax], 0xc
│ └──< 0x00014868 75e8 jne 0x14852
└ 0x0001486a e861feffff call fcn.000146d0
; CALL XREFS from fcn.00013d00 @ 0x13d9d, 0x13da8
"""
print("\n".join(re.findall('0x[0-9a-fA-F]{8}[0-9a-fA-F](.*?)',text)))
so I want the output like this:
push rbx
xor esi, esi
xor edi, edi
call sym.imp.getcwd
test rax, rax
mov rbx, rax
je 0x14860
mov rax, rbx
pop rbx
ret
call sym.imp.__errno_location
cmp dword [rax], 0xc
jne 0x14852
call fcn.000146d0
CodePudding user response:
You can try this:
out = "\n".join(re.findall(r"0x[0-9a-fA-F]{8} [^ ] ([a-z].*)", text))
print(out)
It gives:
push rbx
xor esi, esi
xor edi, edi
call sym.imp.getcwd
test rax, rax
mov rbx, rax
je 0x14860
mov rax, rbx
pop rbx
ret
call sym.imp.__errno_location
cmp dword [rax], 0xc
jne 0x14852
call fcn.000146d0
CodePudding user response:
You can use re.sub to convert matches of the following regular expression to empty stings:
(?m)^(?:(?!.{12}0x[\da-fA-F]{8}).*\r?\n|.{43})
Python regex <¯\(ツ)/¯> Python code
The regular expression can be broken down as follows (one may also hover the cursor over each part of the expression at the "Python regex" link to obtain an explanation of its function).
(?m) # set multiline flag causing '^' and '$' to match
# the beginning and end of each line respectively
^ # match beginning of line
(?: # begin non-capture group
(?! # begin negative lookahead
.{12}0x # match 12 characters followed by '0x'
[\da-fA-F]{8} # match 8 characters contained in the character class
) # end negative lookahead
.*\r?\n # match the entire line including the terminator
| # or
.{43} # match 43 characters
) # end non-capture group
