Home > Blockchain >  How to convert UTF-8 notation to python unicode notation
How to convert UTF-8 notation to python unicode notation

Time:01-16

Using python3.8 I would like to convert unicode notation to python notation:

s = 'U 00A0'
result = s.lower() # output  'u 00a0'

I want to replace u with \u:

result = s.lower().replace('u ','\u') 

But I get the error:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

How can I convert the notation U 00A0 to \u00a0 ?

EDIT:

The reason I wanted to get \u00a0 is to further use encode method to get b'\xc2\xa0'.

My question: given a string in the following notation U 00A0 I would like to convert it to byte code b'\xc2\xa0'

CodePudding user response:

you are struggling with the representation of something versus its value...

import re
re.sub("u\ ([0-9a-f]{4})",lambda m:chr(int(m.group(1),16)),s)

but for u 00a0 this becomes \xa0

but same with the literal \u00a0

s = "\u00a0"
print(repr(s))

once you have the proper value as a unicode string you can then encode it to utf8

s = "\xa0"
print(s.encode('utf8'))
# b'\xc2\xa0'

so just final answer here

import re
s = "u 00a0"
s2 = re.sub("u\ ([0-9a-f]{4})",lambda m:chr(int(m.group(1),16)),s)
s_bytes = s2.encode('utf8') # b'\xc2\xa0'

CodePudding user response:

You can also use this:

>>> s = 'U 00A0'
>>> s = s.replace('U ', '\\u').encode().decode('unicode_escape').encode()
>>> s
b'\xc2\xa0'

CodePudding user response:

You need to escape the \ in replace with a second \:

result = s.lower().replace('u ','\\u') 
print(result)

will give you \u00a0

  •  Tags:  
  • Related