博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Work with Unicode, CCSID & DBCS
阅读量:7235 次
发布时间:2019-06-29

本文共 2806 字,大约阅读时间需要 9 分钟。

Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application executable to work for a global audience.

Unicode provides a unique number for every character, regardless of platform, language, or program.
A Unicode transformation format (UTF) is the algorithmic mapping from every Unicode value to a unique byte sequence.
UTF-8 converts (via an algorithm) Unicode data so that it: Uses 8 data bits to encode the data.
OS/400(R) supports UTF-8 encoding with CCSID 1208.
UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements.
OS/400(R) supports UTF-16 encoding with CCSID 1200.
UTF-32 is an encoding of Unicode in which each character is composed of 4 bytes.
The ASCII (American Standard Code for Information Interchange) character set uses 7-bit units, with a trivial encoding designed for 7-bit bytes. It is the most important character set in use today, despite its limitation to very few characters, because its design is the foundation for most modern character sets. ASCII provides only 128 numeric values, and 33 of those are reserved for special functions.
The EBCDIC (Extended Binary-Coded Decimal Interchange Code) character set and a number of associated character sets, designed by IBM(R) for its mainframes, uses 8-bit bytes. It was developed at a similar time as ASCII, and shares the same set of base characters and has other similar properties. Unlike ASCII, the Latin letters are not combined in two blocks for upper- and lower-case. Instead, the letters are arranged so that their hexadecimal values have second digits of 1 through 9 (another punch card-friendly design).
The most common encodings (character encoding schemes) use a single byte per character, and they are often called single-byte character sets (SBCS). They are all limited to 256 characters. Because of this, none of them can even cover all of the accented letters for the Western European languages.
However, East Asian writing systems needed a way to store over 10,000 characters, and so double-byte character sets (DBCS) were developed to provide enough space for the thousands of ideographic characters in East Asian writing systems. Here, the encoding is still byte-based, but each two bytes together represent a single character.
Even in East Asia, text contains letters from small alphabets like Latin or Katakana. These are represented more efficiently with single bytes. Multi-byte character sets (MBCS) provide for this by using a variable number of bytes per character, which distinguishes them from the DBCS encodings.
The CCSID for bit data is 65535.

转载于:https://www.cnblogs.com/pegasus923/archive/2011/10/27/2227150.html

你可能感兴趣的文章
RESTful API 设计最佳实践
查看>>
DTD中的属性类型
查看>>
git 服务器的搭建
查看>>
Redis学习笔记8--Redis发布/订阅
查看>>
C++ 类的动态组件化技术
查看>>
【JS小技巧】JavaScript 函数用作对象的隐藏问题(F.ui.name)
查看>>
《OpenGL编程指南第七版》学习——编译时提示“error C2381: “exit” : 重定义;__declspec(noreturn) 不同”错误的解决办法...
查看>>
SaltStack–Job管理
查看>>
firefox快捷键窗口和标签类
查看>>
SpringBoot配置ActiveMQ
查看>>
作用域重叠
查看>>
Java注解的简单了解
查看>>
Effective C++笔记:构造/析构/赋值运算
查看>>
Codeforces 362D Fools and Foolproof Roads 构造题
查看>>
记一次ckeditor上传图片到服务器问题
查看>>
cookies
查看>>
springmvc复习笔记----文件上传multipartResolver
查看>>
方法多种,选择随已定
查看>>
SharePoint中CAML使用的一些总结
查看>>
Bundle数据传输
查看>>