Normalization - Wide Character

When your program manipulates SBCS/DBCS mixed strings, it is not easy to recognize DBCS characters as described in How To Use Both DBCS and SBCS in a String. In order to avoid this, you may take the following way (which we call normalization). All characters including SBCS are represented in two-byte units. For example, all SBCS characters are prefixed with NULL to distinguish them from real double-byte characters. After all necessary string manipulations, you have to de-normalize the manipulation results to SBCS/DBCS mixed strings, which can be accepted by the system or by other programs.

    Example

  Input Data Stream
                         ┌─────┐
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐    │     │  ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│s│s│D 1│D 2│D 3│s│s│  ─>│     │─>│0│s│0│s│D 1│D 2│D 3│0│s│0│s│
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘    │  C  │  └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
                         │  O  │                │
                         │  N  │                V
                         │  V  │  ┌───────────────────────────┐
                         │  E  │  │      Processing Data      │
                         │  R  │  └───────────────────────────┘
  Output Data Stream     │  T  │                │
                         │  E  │                V
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐  │  R  │  ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│s│D 1│s│D 2│D 3│D 4│s│<─│     │<─│0│s│D 1│0│s│D 2│D 3│D 4│0│s│
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘  │     │  └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
                         └─────┘

      where '0' denotes null whose value is 00H.

Note that it is a little bit complex to predicate the final byte length ("final" means "after de─normalization") during manipulation processes.

This technique is available in IBM C language, which supports Wide Character data type.

[Back: Ordering]
[Next: Text Processing]