I have observed that when a uint8_t type buffer (not guaranteed to be null terminated) is read into a stringstream with the << operator using ss << buff.data, and the contained std::string is returned to Python, Python throws an error:
UnicodeDecodeError: 'utf-8' codec can't decode byte
But, if I use ss.write(buff.data, buff.size), the issue is not there.
I assume that this issue is because when using <<, there is a buffer overrun, and the data in ss might not be UTF-8 anymore. But when I do write(), I define the size and so there is no possibility of garbage data.
What is surprising is that if I do ss.write(buff.data, buff.size 1), I always observe a segfault. So I can't figure out how << can do a buffer overrun? Is there a fundamental difference between how both of these work, and so one triggers a segfault when it makes an illegal buffer access, and the other one does not? Or, is << just getting lucky?
CodePudding user response:
uint8_t is an alias for unsigned char. When operator<< is given an unsigned char* pointer, it is treated as a null-terminated string, same as a char* pointer. So, if your data is not actually a null-terminated character string, writing it to the stream using operator<< is undefined behavior. The code may crash. It may write garbage to the stream. There is no way to know.
write() doesn't care about a null terminator. It writes exactly as many bytes as you specify. That is why you don't have any trouble when using write() instead of operator<<.
