C test for validation UTF-8-CodePudding

I need to write unit tests for UTF-8 validation, but I don't know how to write incorrect UTF-8 cases in C :

TEST(validation, Tests)
{
    std::string str = "hello";
    EXPECT_TRUE(validate_utf8(str));

    // I need incorrect UTF-8 cases
}

How can I write incorrect UTF-8 cases in C ?

CodePudding user response：

You can specify individual bytes in the string with the \x escape sequence in hexadecimal form or the \000 escape sequence in octal form.

For example:

std::string str = "\xD0";

which is incomplete UTF8.

Have a look at https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt for valid and malformed UTF8 test cases.

CodePudding user response：

In UTF-8 any character having a most significant bit of 0 is an ordinary ASCII character, any other one is part of a multi-byte sequence (MBS).

If second most significant one is yet another one then this is the first byte of a MBS, otherwise it is one of the follow-up bytes.

In the first byte of a MBS the number of subsequent highest significant one-bits gives you the number of bytes of the entire sequence, e. g. 0b110xxxxx with arbitrary values for x is the start byte of a two-byte sequence.

Theoretically you could now produce sequences up to seven bytes, currently they are limited to four or five bytes (not fully sure here, you need to look up).

You can now produce arbitrary code points by defining appropriate sequences, e.g. "\xc8\x85" would represent the sequence 0b11001000 0b10000101 which is a legal pattern and represents code point 0b 01000 000101 (note how the leading bits representing the UTF-8 headers are cut away) corresponding to a value of 0x405 or 1029. If that's a valid code point at all you need to look up, I just formed an arbitrary bit pattern as an example.

The same way you can now represent longer valid sequences by increasing the number of most significant one-bits joined with the appropriate number of follow-up bytes (note again: number of initial one-bits is total number of bytes including the first byte of the MSB).

Similarly you now produce invalid sequences such that the total number of bytes of the sequence does not match (too many or too few) the number of initial one-bits.

So far you can produce arbitrary valid or invalid sequences where the valid one represent arbitrary code points. You now might need to look up which of these code points are actually valid ones.

Finally you might additionally consider composed characters (with diacritics) – they can be represented as a character (not byte!) or a normalised single character – if you want to go that far then you'd need to look up in the standard which combinations are legal and conform to which normalised code points.