程式語言的字元編碼

從 C/C++ 開始接觸程式語言的人，在 x86 / ARM 架構下會把 char 當成 1 byte，但高階程式語言的 char 跟 C 語言的不同，每個語言對於 char 有各自不同的長度、轉換規則，挑了幾個比較常寫的語言：Java、C#、Golang 和 Python 3，它們各自如何處理字元編碼？

Java

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities.

The Java platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/lang/Character.html#unicode

char 固定 2 bytes，String 用 UTF-16 編碼儲存。

`String.length()`

The length is equal to the number of Unicode code units in the string.

Code unit 相當於 1 char，所以 emoji 會是 2 chars。

計算 unicode 字元數量需使用 String.codePoints().count()

`String.charAt()`

If the char value specified by the index is a surrogate, the surrogate value is returned.

表示遇到補充字時，charAt 會拆成 2 chars 來看待，換句話說，會拿到代理字元，造成不預期行為發生。

var s = "😀";
for (int i = 0; i < s.length(); i++) {
  System.out.println(s.charAt(i));
}

取得單個 unicode 字元需使用 String.codePoints().toArray()[]

C#

https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/char

與 Java 行為相同，char 是固定 2 bytes，String 用 UTF-16 編碼儲存，String.Length 與 String[] 行為也與 Java 相同。

Golang

Golang 沒有 char，byte 是 uint8 的別名，rune 是 uint32 的別名，string 是一種特殊的 []byte。

Range loops

Golang 對 string 做迭代會解析 UTF-8 形成多個 rune，但 index 會是 byte index，

const s = "😀一2三"
for i, v := range s {
    fmt.Printf("%d: %#U\n", i, v)
}

0: U+1F600 '😀'
4: U+4E00 '一'
7: U+0032 '2'
8: U+4E09 '三'

Python 3

Python 沒有 char 型別，但有 ord() 與 chr() 將一個字的 string 與 int 互相轉換。

Given a string representing one Unicode character, return an integer representing the Unicode code point of that character

https://docs.python.org/3/library/functions.html#ord

對應的 int 是 Unicode 碼點，代表 ord() 轉換出來的數值範圍是 0 ~ 2^31-1。

總結

每個語言對於字元(char)的定義是有差的，在處理字元與字串時要有這方面意識，才能避免出現 Unicode smuggling 的注入漏洞。

Java#

String.length()#

String.charAt()#

C##

Golang#

Range loops#

Python 3#

總結#