Backpacking and the Stories I Tell Myself

On the death anniversary of Genghis Khan I packed my bags and went alone to Mongolia via the iconic Trans-Siberian railway system and got me another story to tell myself for the days I need a hero.

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Unicode universe

Not long ago was the time when emails with headings like “?????? ????????” were commonplace whereas nowadays there is no such problem. What has changed? To start, you need to realise the immense number of languages and dialects that exist (or existed) in the world. People can make use of it and communicate with each other, but the computer itself is not so smart.

Two main goals of Unicode are:

The strict Unicode structure does not seek simplification but encompasses many complex scenarios. This applies, for example, to texts in which several languages are used at once.

Speaking of Unicode, it is important to mention the codespace. The codespace is a set of all possible code points. In total, there are more than 1,114,100 code points in the codespace, of which just under 130,000 (about 12%) are assigned. It is worth noting that Unicode allocated a number of code points for personal use, which is not subject to the standard purpose and can be used by certain applications for their specific purposes.

For a better understanding of the structure of the codespace, 256 (16 × 16) boxes placed in one large plane can be imagined. In one plane there are 65,536 boxes. There are 17 such planes.

Many code points in the codespace are adapted or borrowed from earlier encodings, which gives advantages over compatibility issues. For example, preventing losses when converting text from other encodings to Unicode.

To summarize given information, below is a list of key advantages of Unicode:

When working with text it is important to think about the internal representation. Speaking of Unicode, the internal representation is an array of 16 or 32-bit integer unsigned numbers. The internal representation can work with different encoding types (we will talk about it later), such as UTF-8, UTF-16, and UTF-32, where UTF stands for Unicode Transformation Unit. A developer should think of encoding when reading text data and printing it out.

Unicode supports many diacritic signs and every sign can be applied to only one letter or several. Some sophisticated situations may arise with, for example, characters “é”, “ö”, “ż” and others. Do they consist of one character or two — the letter itself and the diacritical mark? What happens if this symbol to be reversed, will the diacritical mark remain in the right place? How to use text search functions with such characters if a keyboard does not have it?

In Unicode there is a system of characters that are dynamically composed, this system allows to combine codepoints and create desired characters. When a text editor has a similar sequence in a line, a composed character is created automatically.

ASCII and Unicode are two different encoding approaches. Their main task is to operate (i.e. store, record, etc.) symbols in binary format. The key difference between ASCII and Unicode is the encoding approach and the number of bits allocated per character.

Today, ASCII uses 8 bits per character, while Unicode allows choosing between 32, 16 and 8-bit encoding types thanks to the variable bit encoding program. Why do we need more bits? With more bits, it’s possible to use more characters that result in larger files. Extra space can be saved by using fewer bits.

In Unicode, all code points are standardized, whereas ASCII offers many non-standard programs. For example, a developer may face some problems displaying characters unless he or she doesn’t use the common pages that Microsoft uses.

The advantage of Unicode is that it stores a large number of characters. Even though Unicode covers most written languages, there is still enough free space to add new ones. That is why Unicode supposed to be relevant in the near future.

Given the widespread popularity of ASCII, Unicode was developed with support for ASCII compatibility. More precisely, the first 8 bits of Unicode correspond to the bits of the most popular ASCII code page. A code page is a coding scheme that maps a certain combination of bits to a character representation. Thus, opening a file in ASCII encoding with Unicode, a developer still get the correct characters.

According to the above information, it can be concluded that the advantages of Unicode over ASCII are:

Why do we need encoding? For different types of data to be used within different systems, data must be correctly encoded. Encoding data is not about hiding it, it is about processing it for the later use. For encoding no secret keys are needed but a generally recognized scheme according to which everything works. Decoding of formats is carried out due to a special algorithm.

Today, Unicode uses several types of encoding, most common of them are UTF-8, UTF-16, and UTF-32.

All valid code points of Unicode can be encoded with UTF-16. This encoding type is mostly used by Microsoft Windows.

UTF-32 allocates 4 bytes per code point. When working with UTF-32, it is necessary to provide additional memory space, especially in cases with large texts. Today, UTF-32 is used very rarely.

UTF-8 and UTF-16 are used much more often. UTF-8 uses 8 bits (1 byte) for encoding, this scenario is applicable for English characters, in cases of encoding other characters certain sequences of bytes are applicable. UTF-16 uses 16 bits (2 bytes) to encode most characters. In complex cases, symbols can be represented by a pair of 16-bit numbers.

The main advantage of UTF-8 is compatibility with ASCII, the character set of which uses 1 byte. When encoding a file that contains only ASCII characters with UTF-8, the size of the final file will be comparable to the size of the file encoded with ASCII. This situation is not possible in the case of encoding UTF-16, because each character is 2 bytes long.

When errors occur, UTF-16 can recover corrupted fragments, however, some bytes can be lost permanently. UTF-8 recovers from errors better, as it can decode uncorrupted bytes.

Why choose UTF-8 instead of UTF-16? The reasons are the following:

As it is said above, UTF-8 uses 8 bits for encoding, it is ASCII compatible and has several other advantages. And yet, there is a couple of features that are worth being highlighted separately.

UTF-8 encoding adds markers to each byte. For the first byte of the multibyte character, the 7th and 6th bits are set, and in the subsequent bytes only 7 bits are set. For example:

Why not let every developer use encoding? It is a complex question from a technical standpoint. Using different encodings may lead to errors and plenty of time spent on fixing these errors. Almost 95% of all webpages use UTF-8 making it a dominant encoding type.

Backpacking and the Stories I Tell Myself

Unicode universe

Add a comment

Related posts:

How to think positively

Juneteenth event extols the liberation that was and the freedom to come

Thoughts on learning OOP in Python