I have a file that have an HTMl code, the HTML tags are encoded like the following content:
\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e
The decoded HTML should be:
<div data-name="region-name" >UK</div>
In Ruby, I used cgi library to unescapeHTML however it does not work because when it read the content it does not identify the encoded tags, here is another example:
require 'cgi'
single_quoted_string = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
double_quoted_string = "\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e"
puts 'unescape single_quoted_string ' CGI.unescapeHTML(single_quoted_string)
puts 'unescape double_quoted_string ' CGI.unescapeHTML(double_quoted_string)
The output of the previous code is:
unescape single_quoted_string \x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e
unescape double_quoted_string <div data-name="region-name" >UK</div>
My question is, how can I make the single_quoted_string act as if its content is double-quoted to make the function understand the encoded tags?
Thanks
CodePudding user response:
Ruby's parser allows certain escape sequences in string literals.
The double-quoted string literal "\x3c" is recognized as containing a hexadecimal pattern \xnn which represents the single character <. (0x3C in ASCII)
The single-quoted string literal '\x3c' however is treated literally, i.e. it represents four characters: \, x, 3, and c.
how can I make the
single_quoted_stringact as if its content is double-quoted
You can't. In order to turn these four characters into < you have to parse the string yourself:
str = '\x3c'
str[2, 2] #=> "3c" take hex part
str[2, 2].hex #=> 60 convert to number
str[2, 2].hex.chr #=> "<" convert to character
You can apply this to gsub:
str = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
str.gsub(/\\x\h{2}/) { |m| m[2, 2].hex.chr }
#=> "<div data-name=\"region-name\" class=\"main-id\">UK</div>"
/\\x\h{2}/ matches a literal backslash (\\) followed by x and two ({2}) hex characters (\h).
Just for reference, a CGI encoded string would look like this:
str = "<div data-name=\"region-name\" class=\"main-id\">UK</div>"
CGI.escapeHTML(str)
#=> "<div data-name="region-name" class="main-id">UK</div>"
It uses &...; style character references.
CodePudding user response:
Your problem has nothing to do with HTML, \x3c represent the hex number '3c' in the ascii table.
Double-quoted strings look for this patterns and convert them to the desired value, single-quoted strings treat it the final outcome.
You can check for yourself that CGI is not doing anything.
CGI.unescapeHTML(double_quoted_string) == double_quoted_string
The easiest way I know to solve your problem is through gsub
def convert(str)
str.gsub(/\\x(\w\w)/) do
[Regexp.last_match(1)].pack("H*")
end
end
single_quoted_string = '\x3cdiv data-name\x3d\x22region-name\x22 class\x3d\x22main-id\x22\x3eUK\x3c/div\x3e'
puts convert(single_quoted_string)
What convert does is to get every pair of hex escaped values and pack them as characters.
