從 C/C++ 開始接觸程式語言的人,在 x86 / ARM 架構下會把 char 當成 1 byte,但高階程式語言的 char 跟 C 語言的不同,每個語言對於 char 有各自不同的長度、轉換規則,挑了幾個比較常寫的語言:Java、C#、Golang 和 Python 3,它們各自如何處理字元編碼?

Java

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities.

The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Character.html#unicode

char 固定 2 bytes,String 用 UTF-16 編碼儲存。

String.length()

The length is equal to the number of Unicode code units in the string.

Code unit 相當於 1 char,所以 emoji 會是 2 chars。

計算 unicode 字元數量需使用 String.codePoints().count()

String.charAt()

If the char value specified by the index is a surrogate, the surrogate value is returned.

表示遇到補充字時,charAt 會拆成 2 chars 來看待,換句話說,會拿到代理字元,造成不預期行為發生。

var s = "😀";
for (int i = 0; i < s.length(); i++) {
  System.out.println(s.charAt(i));
}

取得單個 unicode 字元需使用 String.codePoints().toArray()[]

C#

https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/char

與 Java 行為相同,char 是固定 2 bytes,String 用 UTF-16 編碼儲存,String.LengthString[] 行為也與 Java 相同。

Golang

Golang 沒有 char,byte 是 uint8 的別名,rune 是 uint32 的別名,string 是一種特殊的 []byte。

Range loops

Golang 對 string 做迭代會解析 UTF-8 形成多個 rune,但 index 會是 byte index,

const s = "😀一2三"
for i, v := range s {
    fmt.Printf("%d: %#U\n", i, v)
}
0: U+1F600 '😀'
4: U+4E00 '一'
7: U+0032 '2'
8: U+4E09 '三'

Python 3

Python 沒有 char 型別,但有 ord()chr() 將一個字的 string 與 int 互相轉換。

Given a string representing one Unicode character, return an integer representing the Unicode code point of that character

https://docs.python.org/3/library/functions.html#ord

對應的 int 是 Unicode 碼點,代表 ord() 轉換出來的數值範圍是 0 ~ 2^31-1。

總結

每個語言對於字元(char)的定義是有差的,在處理字元與字串時要有這方面意識,才能避免出現 Unicode smuggling 的注入漏洞。