- The BLOG explains how slices work in Go
- This post is more or less summary of Go string
- Efficient use of string requires understanding not only how they work, but also
- the difference between a byte, a character and a rune
- the difference between Unicode and UTF-8
- the difference between a string and a string literal
- and other even more subtle distinctions
"When I index a Go string at position n, why don't I get the nth character?" As you'll see, this question leads us to many details about how text works in the modern world."
An excellent introduction to some of these issues, independent of Go, is Joel Spolsky's famous blog post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
What is string?
- In Go, a string is in effect a read-only slice of bytes.
- It's important to state right up front that a string holds arbitrary bytes. It not required to hold Unicode text, UTF-8 text or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes
Code Points, Characters and Runes
- As mentioned above, a string holds bytes.
- The idea of $Character$ is a little hard to define. The Unicode standard uses the term $Code Point$ to refer to the item represented by a single value. The code point U+2318, with the hexadecimal value 2318, represents the symbol ⌘. The concept of $Character$ in computing is ambiguous or at least confusing.
- $Code Point$ is a bit of a mouthful, so Go introduces a shorter term for the concept: $rune$. It means exactly the same as $Code Point$ with one interesting addition. The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go. The type and value of the expression '⌘' is rune with integer value 0x2318.
Range Loops
- With a regular $for$ loop, we get bytes of the string
- A $for\ range$ loop, by contrast, decodes one UTF-8-encoded $rune$ on each iteration. Each time around the loop is the starting position of the current rune, measured in bytes and the $Code\ Point$ is its value.
No comments:
Post a Comment