Why printed string is shorter when locale-charater is used-CodePudding

I wrote the following code. I'm trying to print a string of specified length with a non-ASCII character.

int main(int argc, char **argv)
{   
    setlocale(LC_ALL,"pl_PL");
    printf("%-10sx\n","ą");
    printf("%-10sx\n","a");
    return 0;
}

The output is as follows:

ą        x
a         x

There is one (white)space less when a non-ASCII character is used. Why does it behave like this?

CodePudding user response：

Why printed string is shorter when locale-charater is used

Because the number of columns needed to represent a multi-byte string is not equal to the number of bytes the character takes.

Why does it behave like this?

The string "ą" takes 2 bytes (and 1 more byte for zero terminating character), but is displayed on 1 column. So there will be 8 spaces of padding.

The string "a" has a length of 1 byte, so there will be 9 spaces of padding.

Is there any way to overcome this issue without manually changing the desired length when the string contains a non-ASCII character?

Use a library that has a database of mappings between characters and their width for the encoding your are using. Iterate over the string, get the number of columns needed to represent it. Then add the displayed width to the offset that you want to have from that length. Overall, getting displayed width of characters is a non-trivial task and there are problems and edge cases.

Dive to unicode world with https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ . Read about normal, multi-byte and wide stings and about uchar8_t uchar16_t and uchar32_t strings - https://en.cppreference.com/w/c/string/wide and https://en.cppreference.com/w/c/string/multibyte .

On Linux your locale is using by default using UTF-8, your terminal is most probably using UTF-8 and your compiler is using UTF-8 encoding for string literals (these are all separate properties and can mix). On Linux, you can convert the string to wide string (which is hard on its own) and iterate over the string and use wcswidth to get the number of columns. There are also libraries - libunistring with u8_width function, ICU has u_countChar32 and similar.

I could see something along:

#define _XOPEN_SOURCE   // for wcwidth on Linux
#include <wchar.h>
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <limits.h>
#include <stdint.h>
#include <locale.h>

size_t mbswidth(const char *s, size_t n) { // similar to wcswidth
   mbstate_t ps;
   memset(&ps, 0, sizeof(ps));
   size_t ret = 0;
   while (n != 0 && *s != 0) {
      wchar_t wc;
      const size_t rr = mbrtowc(&wc, s, n, &ps);
      if (rr == (size_t)-1 || rr == (size_t)-2) {
          return rr;
      }
      assert(rr != 0);
      n -= rr;
      s  = rr;
      ret  = wcwidth(wc);
   }
   return ret;
}

int main() {
   setlocale(LC_ALL, "pl_PL.UTF-8"); // see https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html
   const char *s = "ą";  // I think I would `= u8"ą";` on newer compilers
   printf("%-*sx\n", 10   (int)mbswidth(s, SIZE_MAX), s);
   printf("%-*sx\n", 10, "a");
}