┌──────────────────────────────────────────────────────────────────────┐ │One or more bytes have been used to store a character. │ └──────────────────────────────────────────────────────────────────────┘
In the C language, there is a data type named char which, in reality, is a byte. One char is not always equivalent to one character. On a DBCS environment, two char data objects are used to store one DBCS character. Such concept is called a multibyte character in C.
Though there are many C standard functions to handle character strings, no special provisions are made to handle multibyte characters. For example, the strncpy() function would bisect a DBCS character, and the strchr() function would misinterpret either byte of DBCS as the target SBCS character.
If application programmers wanted to use these functions for a multibyte text, they would have to add a special logic for handling multibyte characters. The logic would be different among encoding schemes, which would prevent the application from being code page independent.
To avoid this problem, an application should generally process text in a wide character (wchar_t) form.
┌──────────────────────────────────────────────────────────────────────┐ │Type wchar_t is an integral type defined in <stddef.h> header. It is │ │capable of having any member of the largest character set specified │ │among the supported locales. │ └──────────────────────────────────────────────────────────────────────┘
The wchar_t data type can store every character. It is transparent from an application whether a character represented in multibyte format is in one byte or more. The internal representation of the wchar_t data type is implementation-defined, and assuming the representation decreases the program's portability. For instance, if an application assumes that a wide character holds the same integral value with the corresponding ASCII code, the program may not work on the environment whose wide character representation is Unicode.
The following figure shows how a string made of four characters is represented
in multibyte and wide character formats. The characters take up six bytes
in a multibyte format, in which each byte is possibly treated as an individual
component. On the other hand, wide character format is intuitive since
four wide characters are used to represent the characters.
A string represented in multibyte format and wide character format
multibyte:
1 2 3 4 5 6 ... bytes
┌───┬───┬───┬───┬───┬───┐
│ s │ s │ D │ 1 │ D │ 2 │
└───┴───┴───┴───┴───┴───┘
wchar_t:
1 2 3 4 ... units (=wchar_t)
┌────┬────┬────┬────┐
│ s │ s │ D1 │ D2 │
└────┴────┴────┴────┘