unicodedata.name('\x00') raises a ValueError exception:
Python 3.8.10 (default, Sep 28 2021, 16:10:42)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information
>>> import unicodedata
>>> unicodedata.name('\x00')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name
\x00 is the NUL character. Why does unicodedata.name('\x00') raise an exception? I am getting the same error for other non-printable ASCII characters (\x00 to \x1F, and \x7F). Is unicodedata.name() only for printable characters? If so, where is it mentioned in the Python documentation?
CodePudding user response:
If you look at what the name of a unicode character means, it refers to this list: https://www.unicode.org/Public/13.0.0/ucd/NamesList.txt
As you can read, all the non-printable ASCII control characters are named "<control>": "NULL" is not the name of 0000, it's an alias.
Now, why doesn't Python display "<control>" is another question that I can't answer.
CodePudding user response:
As per this Wikipedia article, Cc control characters have no name in Unicode. All the characters you mentioned are categorized under Cc category(You can confirm this by using unicodedata.category API)
>>> import unicodedata
>>> unicodedata.category('\x00')
'Cc'
>>> unicodedata.category('\x1F')
'Cc'
>>> unicodedata.category('\x7F')
'Cc'
In Unicode, "Control-characters" are U 0000—U 001F (C0 controls), U 007F (delete), and U 0080—U 009F (C1 controls). Their General Category is "Cc". Formatting codes are distinct, in General Category "Cf". The
Cccontrol characters have no Name in Unicode, but are given labels such as"<control-001A>"instead.
You can also see CONTROL CHARACTERs are explicitly handled in cpython source code
