I have a dataframe that contains a text column. The text column contains text that is striked out using the unicode character '\u0336'. In that case the alignment is messed up. Why? How can I fix this?
if __name__ == '__main__':
import pandas as pd
phenotype = {0: 0, 1: 0}
text = ''.join(x '\u0336' for x in str(phenotype))
data = {"phenotype": [f"{phenotype}", text]}
print(pd.DataFrame(data=data).to_string(justify="right"))
Result:
phenotype
0 {0: 0, 1: 0}
1 {̶0̶:̶ ̶0̶,̶ ̶1̶:̶ ̶0̶}̶
Expected:
phenotype
0 {0: 0, 1: 0}
1 {̶0̶:̶ ̶0̶,̶ ̶1̶:̶ ̶0̶}̶
CodePudding user response:
Using combining characters is brave, and you have been bitten.
Most display things know about the most common unicode characters, but as soon as the number of characters and the number of display position are different, weird things are to be expected.
Despite having decent formatting features, Pandas is mainly a computing tool. Furthermore, its underlying storage is numpy, which means that it will be great at processing numeric data, and less efficient when it comes to strings. What you are trying to do is not what pandas is meant for. IMHO it is indeed a bug, and you can send a bug report about it. Unsure whether it will be fixed and why, because it is not about the core goal.
The expected way to denote that something is deleted is to add an additional boolean column, or to replace the value with a NaN or an empty string, or... but please do not try to use the COMBINING LONG STROKE OVERLAY U 0336 unicode character. I can confirm that Tk tools like IDLE do not correctly process it either.
A possible way, if you use Jupyter, would be to use HTML styles for that. But it will only work on a Jupyter notebook...
