Introduction to Character/Text Encoding in Web

Character Encoding in HTML, CSS & JavaScript

In this article, we will learn about different techniques to display ASCII and UTF encoded characters in JavaScript, CSS, and HTML.

Encoding is a way to convert one format of data into another. Character encoding is a way to convert a character that can be displayed on the screen into a binary representation so that it can be stored in memory or transferred over a network.

I have discussed character encoding in-depth in this article. I recommend you to read the Character Encoding topic in this article first before reading anything further. But if you understand how ASCII and UTF character encoding works, then you are free to skip it.

When a browser receives a file or data from the server, the request generally contains a Content-Type header with it. This header defines the MIME type of the document and its encoding.

For example, an HTML file sent by the server will have the Content-Type header with text/html;charset=UTF-8 value. This value is used by the browser to appropriately render the HTML document along long as the document received is HTML and it is encoded in UTF-8 encoding.

However, we can also specify the encoding of the HTML document in the document itself using http-equiv="content-type" Meta tag.

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

However, this meta tag was used up until HTML 4. In HTML 5, we use charset meta tag, as explained below.

<meta charset="UTF-8">

These meta tags provide additional information to the browser about the document’s encoding. The browser first looks at the Content-Type attribute and decodes the document as per the encoding mentioned in the attribute value. If the Content-Type header is missing, it will use the default encoding.

If the browser encounters one of the above encoding definition meta tags, it will re-parse the document using the prescribed encoding scheme.

This is also valid for external JavaScript and CSS files. JavaScript or CSS files can also be encoded in any encoding scheme. We can use Content-Type header to specify the encoding of these files.

If the Content-Type response header is missing with the response of an external CSS file, then we can use @charset rule to specify the encoding of the file in the CSS code itself.

@charset "utf-8";

However, Content-Type header or other metadata of the document defines the encoding of the document file so that the browser can parse it into the valid sequence of characters.

However, in this article, we are going to talk about encoding characters itself using some tools provided by the language.

Character Encoding in JavaScript

ASCII is one of the most popular encodings still in use today. It is a 7-bit fixed-length encoding scheme that uses 7 bits to encode a character. This encoding can only encode characters in the English language and some commonly used symbols like dash (-) and period (.).

However, JavaScript engine stores string literals in the UTF-16 encoding format. UTF-16 is 16-bit variable-length encoding scheme. UTF-16 can encode all the characters in Unicode character set with 1 or 2 code units.

If you are coming from the article I mentioned before, then you know what a code unit is. The code unit is the building block of the character’s encoding representation. For UTF-16, the code unit is 16 bits. If a character needs more memory, it can add another code unit making a total of 32 bits.

Some old JavaScript engines might use UCS-2 encoding. UCS-2 is fixed-length encoding which uses only 16-bit to encode a character. Since UCS-2 and UTF-16 both use the UTF character set, their encoding is identical.

Since UTF-16 can use 16 or 32 bits to encode a character, it can encode more character than UCS-2 which only uses 16 bits of memory per character. However, the USC-2 character encoding scheme is old and obsolete.

Newer versions of the JavaScript engine uses the UTF-16 encoding scheme to encode characters in a string. As of ECMAScript 2015 (ES6), strings literals are stored in UTF-16 encoding⁰ and have full UTF-16 support.

If you are coming from the article I mentioned before, then you know what a code point is. A code point is the integer value assigned a character that is used by a program to identify the character. The code point is important because no matter how a character is encoded, this unique decimal value will always point to the same character.

For ASCII encoding, since we have 7 bits to encode a character, a total of 128 (2⁷) characters can be encoded with ASCII encoding with the value from 0 to 127. Hence the code point of the character ranges from 0 to 127. The code point of character A is 65₁₀ or 41₁₆.

For UTF-16 encoding, we can encode a character in 16 bits or 32 bits. If you look at the UTF-16 encoding table, characters with a code point from 0₁₆ to FFFF₁₆ are represented with 1 code unit (16 bits) and characters with a code point from 10000₁₆ to 10FFFF₁₆ are represented with 2 code units (32 bits).

In the UTF charset, the code point of character A is 65₁₀ or 41₁₆ and the character आ, the code point 906₁₆. Since the code point these characters fall in the 0₁₆ - FFFF₁₆ range, they can be encoded with just one code unit of UTF-16. Hence these character takes only 16 bits of memory.

However, characters with a code point greater than FFFF₁₆ need two code units. For example, emoticon character 😊 (happy face) has code point 1F60A₁₆. Hence it needs two code units or 32 bits of memory to encode.

When a character needs two code units, each code unit is called a surrogate code unit and together they are called surrogate pairs. For character 😊, the surrogate pair is D83D₁₆ DE0A₁₆.

Unicode Escape

We can store any character in a single using string literal in JavaScript, which is by putting character or sequence of characters in a single or double-quoted string. JavaScript will store this string in UTF-16 encoding for us.

However, can also use code point of a character to represent a character. For this, we need to use \u prefix followed by the hexadecimal representation of the character’s code point in the UTF-16 encoding scheme. \u prefix is called Unicode Escape character.

A code unit represented with Unicode Escape character forms Unicode Escape. Hence for characters encoded in a single code unit of UTF-16, a Unicode Escape looks like below.

'\uXXXX'

Here, XXXX is exactly 4 digit hexadecimal representation of the character’s code point. In this case, the code unit is identical to the code point.

When characters are encoded with two code units, we need two Unicode Escapes for each surrogate code unit.

'\uYYYY\uXXXX'

💡 When one or more unique escapes are put together, they are called Unicode Character Escape Sequence.

⦿ — — — — ⦿

ASCII Charset

All the characters with the code point between 0 (0₁₀) and 7F₁₆ (127₁₀) belong to the ASCII character set. Since these characters are encoded with just one code unit of UTF-16, we need only one Unicode Escape to represent these characters individually.

For example, For character A with code point 41₁₆ (65₁₀), its code unit looks exactly like its code point. Hence, its code point can be represented as 0041₁₆. Its Unicode Escape is rather simple.

var characterA = '\u0041';
console.log( characterA ); // logs: A

This applies to all the characters from the ASCII character set.

If we want to represent characters from ASCII and extended-ASCII (like ISO 8859–1) character set only, we can use \x prefix which is called as the hexadecimal escape character.

\x is followed by a single-byte hexadecimal number of 2 exact characters which is the code point of the character from Unicode character set. Hence, hexadecimal escape sequence for characters A© is as below.

console.log( '\x41\xA9' ); // logs: A©

⦿ — — — — ⦿

UTF Charset

UTF characters can either take one or two code units of UTF-16. Hence, how we write Unicode Escape depends on the code point of the character.

For character आ with code point 906₁₆, since it can be encoded in just one code unit, its Unicode Escape looks like below.

console.log( '\u0906' ); // logs: आ

The character 😊 with code point 1F60A₁₆ needs two code units of UTF-16. D83D₁₆ and DE0A₁₆ are the values of each code unit (16-bit number). Hence the Unicode Escape Sequence of this character looks like below.

console.log( '\uD83D\uDE0A' ); // logs: 😊

You can find UTF-16 code units of a character using this online tool. However, dealing with the UTF-16 character is not always easy, as we need prior knowledge of the number of code units a character can take.

Also, each code point must be exactly 4 digit hexadecimal number to represent as a Unique Escape. Else, JavaScript won’t be able to decode the escape sequence, as shown in the below example.

console.log('\u41');
Error: Uncaught SyntaxError: Invalid Unicode escape sequence

console.log('\u0041');
A

ES6 provides a new way to represent a character with Unicode Escapes using the code point of the character. Using \u{} notation, a character can be represented using its hexadecimal code point.

console.log( '\u{41}' ); // logs: A
console.log( '\u{906}' ); // logs: आ
console.log( '\u{1F60A}' ); // logs: 😊

⦿ — — — — ⦿

Mixed Characters

Unique Escape is nothing but a fancy way to represent a character using the encoding information of the character. Hence, it is valid to mix regular characters in their encoded (plain) form with the Unicode Escape sequence.

console.log( '\x41\u0020\uD83D\uDE0A\u0020\u006d\u0061\u006e' );
// A 😊 man

console.log( 'A \uD83D\uDE0A man' );
// A 😊 man

console.log( 'A \u{1F60A} man' );
// A 😊 man

💡 You can check this ASCII table to find out the code points of the characters from the ASCII character set. Or you can use this UTF character set.

⦿ — — — — ⦿

Length of a string

In JavaScript, we use .length prototype property on a string that returns the number of characters in a string.

console.log( 'Ab'.length ); // logs: 2

However, we have been deceived by this definition. In practice, String.prototype.legth returns the total number of UTF-16 code units used to encode the string. This is demonstrated in the below example.

console.log( '😊'.length ); // logs: 2

Since the character 😊 needs two UTF-16 code units, we get 2 as the length of the string. There is no built-in method in JavaScript to return Unicode aware length of a string.

But using ES6 spread operator, an easy fix would be to convert characters to an array and get its length, as shown in the below example.

console.log( [ ...'😊'].length ); // logs: 1

Character Encoding in HTML

How characters are represented in HTML has nothing to do with the JavaScript’s default encoding. However, there is a way to represent a character using the character’s code point.

& is a special character in HTML however it is generally ignored by the developers. &# prefix is reserved for representing a Unicode character and it should be followed by the decimal code point of the character.

However, we can use &#x prefix to represent the same character but with the hexadecimal code point. Since we are dealing with code point only, there is no concept of surrogate pairs in HTML.

💡 If we need to print & character on the screen, we should be using & notation which standard for ampersand character.

&#65; <!-- character A -->
&#x41; <!-- character A -->
&#x1F60A; <!-- character 😊 -->
&#128522; <!-- character 😊 -->

To provide additional information about the encoding of the document to the browser, we should also add the following meta tags. However, modern browsers are optimized to render the UTF characters despite the charset definition.

<html>
  <head>
    <meta charset="UTF-8">

💡 The default encoding of the HTML5 web page is UTF-8.

However, HTML provides some aliases for common characters like & for the & (ampersand) character. These are called escaped characters as we are escaping characters that might have a different meaning in some contexts.

&quot; <!-- character " -->
&amp; <!-- character &-->
&lt; <!-- character < -->
&gt; <!-- character > -->
&copy; <!-- character © -->

You can follow this table for more HTML entities.

Character Encoding in CSS

We can also use Unicode Escape Sequence in CSS but in this case, we don’t use \u prefix. We can use \ character followed by the hexadecimal representation of a character’s code point.

Generally, we use ::before and ::after pseudo-elements to add text content (but not limited to) inside a DOM element. We use the content property of the pseudo-element selector to inject this string value.

.some-elem::before {
  content: "\0041"; /* A */
  content: "\00A9"; /* © */
  content: "\0906"; /* आ */
  content: "\1F60A"; /* 😊 */
}

You can read about character encoding in depth from this article.

Introduction to Character Encoding

Introduction to Character/Text Encoding in Web was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introduction to Character/Text Encoding in Web

Character Encoding in HTML, CSS & JavaScript

In this article, we will learn about different techniques to display ASCII and UTF encoded characters in JavaScript, CSS, and HTML.

Character Encoding in JavaScript

Unicode Escape

ASCII Charset

UTF Charset

Mixed Characters

Length of a string

Character Encoding in HTML

Character Encoding in CSS

Trending Articles

KMS & Digital & Online Activation Suite v5.7

Download: Mimi Crazy ft Daev – Ama Loving (Prod by_Kekero)

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Police search for missing man last seen in Cliftonville in Northampton

Mp3 Download: Mdu - Auntie

Karimnagar District Police Office Mobile Numbers List in Telangana State

Kanulanu Thaake Lyrics and translation | Manam (2014)

Das MausPad • Req.Bin ein Star usw.

[GET] Rob Lennon – AI Lead Magnets + Workshop ($199)

Alessia Cara – Know It All (Album) [2015] – FREE DOWNLOAD – ZIP

The 10 Tennessee Cities With The Largest Black Population For 2021

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Rhys Thomas-Samways – Plymouth

Trailer Park Boys Jail S01-S02 1080p NF WEB-DL H264-FLUX

13 Japanese teen boys caught peeping into girls’ hot spring bath during class...

13917

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Shatta Wale – You Shock Me (Prod. by Willis Beatz)

Practice Sheet of Right form of verbs for HSC Students

Love Status in Punjabi, ਪੰਜਾਬੀ ਲਵ ਸਟੇਟਸ