How To Use Both DBCS and SBCS in a String

In host codes, the shift─out/shift─in control characters (SO/SI) are used as delimiters to distinguish DBCS characters when used with SBCS ones.

               ┌─ Control Character ──┐
               │                      │
               V                      V
┌───────────┬────┬─────────────────┬────┬───────────┐
│ SBCS Data │ SO │    DBCS Data    │ SI │ SBCS Data │
└───────────┴────┴─────────────────┴────┴───────────┘

SO (0EH): Shift out from SBCS (Start of DBCS characters)
SI (0FH): Shift in to SBCS (End of DBCS characters)

Note: Some host products support pure DBCS strings or fields, which contains DBCS string without SO/SI.

PC code pages, however, require no delimiters such as SO/SI control characters because the code ranges of the first bytes of DBCS characters do not overlap SBCS characters. Code Ranges of DBCS First Bytes shows the code ranges of DBCS first bytes.

┌──────────────────────────────┬────────────────┬────────────────┐
│ Language                     │ Code Ranges of │ Code Ranges of │
│       (Combined Code page)   │ DBCS 1st Byte  │ DBCS 2nd Byte  │
├──────────────────────────────┼────────────────┼────────────────┤
│ Japanese    (932/942)        │  81─9F, E0─FC  │  40─7E, 80─FC  │
│ Korean      (949)            │  8F─FE         │  A1─FE         │
│ S. Chinese  (1381)           │  8C─FE         │  A1─FE         │
│ T. Chinese  (938/948)        │  81─FC         │  40─7E, 80─FC  │
│ T. Chinese  (950)            │  81─FE         │  40─7E, 81─FE  │
└──────────────────────────────┴────────────────┴────────────────┘
Code Ranges of DBCS First Bytes

Note:

Once the first byte of a DBCS character is identified by the above test, it should be handled together with its succeeding byte as one DBCS character.

Note: It is important to keep the integrity of DBCS character. The system also handles the DBCS character by this logic. For example, a string of DBCS character is displayed in a DBCS environment, if the second byte of one DBCS character in this string is deleted, then the system will treat the next character as the second byte of this DBCS, and the whole string can not be displayed correctly. The end user will find the string not just losing one character but the whole string becomes a different string, or a strange string. The following shows how it happened.

       ┌─── take away the character 4.
       │
       
┌───┬───┬───┬───┐
│1│2│3│4│5│6│7│8│       The byte data (1) (3) (5) (7) are
└───┴───┴───┴───┘       the DBCS first byte code,
       X                and the (1,2) (3,4) (5,6) (7,8)
       │                form a DBCS string.
       

┌───┬───┬───┬─┐
│1│2│3│5│6│7│8│         After the (4) is deleted,
└───┴───┴───┴─┘         If the (6) is in the DBCS first byte
                        code range, and (8) is not, then
                        the system displays the (1,2) (3,5)
      or                (6,7) as DBCS and (8) as SBCS.

┌───┬───┬───┬───┐       In case of the character 8 is also
│1│2│3│5│6│7│8│?│       in the DBCS first byte code range,
└───┴───┴───┴───┘       then some system error may occur.
             XXX

The error case of the DBCS handling

If your program needs to search for, for example, '%' (25H) whose code point is out of the current DBCS second byte code range, you may simply search for 25H disregarding whether or not each byte of the scanned data is SBCS or DBCS. This logic will work now but may not in the future. You are recommended to avoid this kind of optimizations.


[Back: Country Code and Combined Code Page]
[Next: DBCS Personal System Hardware]