ASCII stands for American Standard Code for Information Interchange, a character encoding standard for electronic communication. ASCII codes represent the text in computers, telecommunications equipment, and other devices.
In the past years, we dealt with one character set, ASCII. It used 7 bits to represent 128 characters, including upper and lowercase English letters, digits, and various punctuations and device-control characters. Unfortunately, a vast number of the world’s population is unable to use their writing system on the computer.
So, Unicode was invented to solve the problem. It is a superset of ASCII and contains all the characters in the world’s writing system.
Unicode assigns a standard number called a Unicode code point, or in the Go language, a rune. The rune type is an alias of int32.
Strings and UTF-8 encoding
However, strings often contain Unicode text encoded in UTF-8, encodes all Unicode code points using one to four bytes. (ASCII characters are encoded with one byte, while other code points use more.)
Since the Go source code is encoded as UTF-8, string literals will automatically get this encoding.
For example, in the string, "Garçon"
the character ç
is encoded using two bytes, while the ASCII characters G, a, r, o, and n) only use one.
See the following code.
// hello.go package main import ( "fmt" ) func main() { fmt.Println([]byte("Garçon")) fmt.Println([]rune("Garçon")) }
Output
go run hello.go [71 97 114 195 167 111 110] [71 97 114 231 111 110]
Golang rune
Golang rune literals are 32-bit integer values (however, they are untyped constants, so their type can change). They represent the Unicode codepoints. “Rune” means Unicode codepoint. (think of it as a character.) It is a term golang invented.
Rune literal represents the rune constant where an integer value recognizes the Unicode code point. For example, the rune literal ‘a’ is the number 97.
In the Go language, the rune is expressed as one or more characters enclosed in single quotes like ‘g,’ ‘\t’, etc. You have allowed a place for any character except the newline and an unescaped single quote in between single quotes. A Golang string is a sequence of bytes in Golang.
UTF-8 encodes all Unicode between 1 to 4 bytes, where 1 byte is used for ASCII, and the rest is used for the rune. ASCII contains a total of 256 elements. 128 are characters, and 0-127 are identified as code points. Here code point refers to the element which represents a single value.
Golang rune type is an alias for int32, indicating that an integer represents the code point. For example, ASCII defines 128 characters, identified by the code points 0–127.
It covers English letters, Latin numbers, and a few other characters. Unicode, the superset of ASCII, defines the codespace of 1,114,112 code points.
Unicode version 10.0 covers 139 modern and historic scripts (including the runic alphabet, but not Klingon) and multiple symbol sets.
When you hear the word “rune,” you can think of it as any or all of the following:
- An integer. (possible values are from 0 to 2^32-1, but not all are valid Unicode codepoints.)
- A golang type, with keyword rune. It is an alias to the type int32
- A Unicode codepoint.
- A character.
Example of Rune
See the following code.
// hello.go package main import ( "fmt" "reflect" ) func main() { // Creating a rune runeK := 'K' runeb := 'b' runef := '\\' // Displaying rune and its type fmt.Printf("Rune 1: %c; Unicode: %U; Type: %s", runeK, runeK, reflect.TypeOf(runeK)) fmt.Printf("\nRune 2: %c; Unicode: %U; Type: %s", runeb, runeb, reflect.TypeOf(runeb)) fmt.Printf("\nRune 3: Unicode: %U; Type: %s", runef, reflect.TypeOf(runef)) }
Output
go run hello.go Rune 1: K; Unicode: U+004B; Type: int32 Rune 2: b; Unicode: U+0062; Type: int32 Rune 3: Unicode: U+005C; Type: int32
Unicode Standard Notation for Codepoint
Unicode has the standard notation for codepoint, starting with U+, followed by its codepoint in hexadecimal. For example,
- space → U+20
- K → U+004B
- ♥ → U+
- ? → U+1F923
Print Rune
Rune is a codepoint; that is why it is an integer.
The following Printf formats work with an integer:
%c → The character as is.
%q → The rune syntax. e.g. ‘a’.
%U → Unicode notation. e.g., U+03B1.
%b → base 2
%o → base 8
%d → base 10
%x → base 16, with lowercase letters for a-f
As in the above example, we have printed the rune values.
Conclusion
Rune is the Type. It occupies 32bit and is meant to represent the Unicode CodePoint.
As an analogy, the English characters encoded in the ‘ASCII’ have 128 code points. Thus can fit inside the byte (8bit).
From this (erroneous) assumption, C treated characters as ‘bytes’ char and ‘strings’ as a ‘sequence of characters’ char*.
But guess what? There are many other symbols invented by humans besides the ‘abcde…’ symbols. And there are so many that we need 32 bits to encode them.
In golang, a string is a sequence of bytes. However, since multiple bytes can represent the rune codepoint, a string value can also contain runes. So, it can be converted to a []rune or vice versa.