Processing math: 100%

Tuesday, April 16, 2019

String in GoLang

REFERENCE @ Blog.GoLang.Org

  • The BLOG explains how slices work in Go
  • This post is more or less summary of Go string
  • Efficient use of string requires understanding not only how they work, but also
    1. the difference between a byte, a character and a rune
    2. the difference between Unicode and UTF-8
    3. the difference between a string and a string literal
    4. and other even more subtle distinctions
"When I index a Go string at position n, why don't I get the nth character?" As you'll see, this question leads us to many details about how text works in the modern world."
An excellent introduction to some of these issues, independent of Go, is Joel Spolsky's famous blog post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

What is string?

  1. In Go, a string is in effect a read-only slice of bytes.
  2. It's important to state right up front that a string holds arbitrary bytes. It not required to hold Unicode text, UTF-8 text or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes

Code Points, Characters and Runes

  1. As mentioned above, a string holds bytes.
  2. The idea of Character is a little hard to define. The Unicode standard uses the term CodePoint to refer to the item represented by a single value. The code point U+2318, with the hexadecimal value 2318, represents the symbol ⌘. The concept of Character in computing is ambiguous or at least confusing.
  3. CodePoint is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. It means exactly the same as CodePoint with one interesting addition. The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go. The type and value of the expression '⌘' is rune with integer value 0x2318.

Range Loops

  1. With a regular for loop, we get bytes of the string
  2. A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop is the starting position of the current rune, measured in bytes and the Code Point is its value.







No comments:

Post a Comment