Computers, Codes, and Characters.

Table of Contents

Overview

These are my notes as I attempt to understand the origins of Computing. This page covers the path I've taken to address a real-world issue I experienced at work that was related to improper encoding as referenced in RFC822. Needless to say, before these notes I was not familiar with '822', the RFC, or even who wrote RFCs. That lead me to discover the IETF, which then led me to understand character codes in their pre-encoded state. So, all these things are a combination of a linear problem to solve, plus various tangents I picked up along the way.

Introduction

I'm working on a bug ticket that has to do with encoding errors that are loosely related to RFC822 and RFC2047. Those two terms introduced me to IETF, which then in turn got me interested in TCP/IP, which then brought me back to the specifics of the RFC's, which is encoding. All of which is new territory for me. I think it's safe to say that any passionate web developer will find these subjects of interest since they are the foundation of our work. As such, I'm keeping some notes of what I'm learning;

What is a character encoding, and why should I care?

source

If you use anything other than the most basic English text, people may not be able to read the content you create unless you say what character encoding you used.

Not only does lack of character encoding information spoil the readability of displayed text, but it may mean that your data cannot be found by a search engine, or reliably processed by machines in a number of other ways.

Characters that are needed for a specific purpose are grouped into a character set (also called a repertoire). (To refer to characters in an unambiguous way, each character is associated with a number, called a code point.)

The characters are stored in the computer as one or more bytes.

Basically, you can visualise this by assuming that all characters are stored in computers using a special code, like the ciphers used in espionage. A character encoding provides a key to unlock (ie. crack) the code. It is a set of mappings between the bytes in the computer and the characters in the character set. Without the key, the data looks like garbage.

So, when you input text using a keyboard or in some other way, the character encoding maps characters you choose to specific bytes in computer memory, and then to display the text it reads the bytes back into characters.

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things. Using Unicode throughout your system also removes the need to track and convert between various character encodings.

UTF-8 is the most widely used way to represent Unicode text in web pages, and you should always use UTF-8 when creating your web pages and databases. But, in principle, UTF-8 is only one of the possible ways of encoding Unicode characters.

Unicode

source

Unicode is a universal character set, ie. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

Text in a computer or on the Web is composed of characters. Characters represent letters of the alphabet, punctuation, or other symbols.

The first 65,536 code point positions in the Unicode character set are said to constitute the Basic Multilingual Plane (BMP). The BMP includes most of the more commonly used characters.

character set versus a character encoding.

It is important to clearly distinguish between the concepts of a character set versus a character encoding.

A character set or repertoire comprises the set of characters one might use for a particular purpose – be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).

A coded character set is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points. A code point value represents the position of a character in the coded character set. For example, the code point for the letter á in the Unicode coded character set is 225 in decimal, or E1 in hexadecimal notation. (Note that hexadecimal notation is commonly used for referring to code points, and will be used here.)

Coded character sets are sometimes called code pages.

The character encoding reflects the way the coded character set is mapped to bytes for manipulation in a computer. The picture below shows how characters and code points in the Tifinagh (Berber) script are mapped to sequences of bytes in memory using the UTF-8 encoding. The code point values for each character are listed immediately below the glyph (ie. the visual representation) for that character at the top of the diagram. The arrows show how those are mapped to sequences of bytes, where each byte is represented by a two-digit hexadecimal number. Note how the Tifinagh code points map to three bytes, but the exclamation mark maps to a single byte.

IETF

What Is the IETF?

The IETF is a loosely self-organized group of people who contribute to the engineering and evolution of Internet technologies. It is the principal body engaged in the development of new Internet standard specifications. The IETF is unusual in that it exists as a collection of happenings, but is not a corporation and has no board of directors, no members, and no dues; see [BCP95], "A Mission Statement for the IETF", for more detail.

IETF Resources

RFC 822

Notes from [rfc 822](https://tools.ietf.org/html/rfc822)

Overview

This standard specifies a syntax for text messages that are sent among computer users, within the framework of "electronic mail".

In this context, messages are viewed as having an envelope and contents. The envelope contains whatever information is needed to accomplish transmission and delivery. The contents compose the object to be delivered to the recipient.

Focus is the message format, not the 'envelope'

This standard applies only to the format and some of the semantics of message contents. It contains no specification of the information in the envelope.


catch all

TO READ

UTF-8

Unicode is a character set. UTF-8 is encoding.

source
  • ASCII
  • UTF-8
  • UTF-16
  • MIME
  • RFC822
  • RFC2047
  • Base64
  • IETF
  • TCP/IP
  • Encoding
  • Decoding