Home > Terms > English, UK (UE) > UTF-8 encoding form

UTF-8 encoding form

The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length.

  • In UTF-8, the code point sequence <004D, 0430, 4E8C, 10302> is represented as <4D D0 B0 E4 BA 8C F0 90 8C 82>, where <4D> corresponds to U+004D, corresponds to U+0430, corresponds to U+4E8C, and corresponds to U+10302.
  • Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed.
  • Before the Unicode Standard, Version 3.1, the problematic "non-shortest form" byte sequences in UTF-8 were those where BMP characters could be represented in more than one way. These sequences are ill-formed, because they are not allowed by Table 3-7.
  • Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.
This is auto-generated content. You can help to improve it.
0
Collect to Blossary

Member comments

You have to log in to post to discussions.

Terms in the News

Featured Terms

Harry8L
  • 0

    Terms

  • 0

    Blossaries

  • 1

    Followers

Industry/Domain: People Category: Sportspeople

Bubba Smith

Bubba Smith was a National Football League star and Actor who spent five seasons with the Baltimore Colts and two seasons each with Oakland and ...

Contributor

Featured blossaries

crime

Category: Other   1 20 Terms

Greatest WWE wrestlers

Category: Sports   3 10 Terms