Programming Documents: Recipe 1.14. Handling International Encodings

Wednesday, October 28, 2009

Recipe 1.14. Handling International Encodings

Problem

You need to handle strings that contain nonASCII characters: probably
Unicode characters encoded in UTF-8.

Solution

To use Unicode in Ruby, simply add the following to the beginning of code.


	$KCODE='u'
	require 'jcode'

You can also invoke the Ruby interpreter with arguments that do the same thing:


	$ ruby -Ku -rjcode

If you use a Unix environment, you can add the arguments to the shebang line of your Ruby application:


	#!/usr/bin/ruby -Ku -rjcode

The jcode library overrides most of the methods of String and makes them capable of handling multibyte text. The exceptions are String#length, String#count, and String#size, which are not overridden. Instead jcode defines three new methods: String#jlength, string#jcount, and String#jsize.

Discussion

Consider a UTF-8 string that encodes six Unicode characters: efbca1 (A), efbca2 (B), and so on up to UTF-8 efbca6 (F):


	string = "\xef\xbc\xa1" + "\xef\xbc\xa2" + "\xef\xbc\xa3" +
	         "\xef\xbc\xa4" + "\xef\xbc\xa5" + "\xef\xbc\xa6"

The string contains 18 bytes that encode 6 characters:


	string.size                                          # => 18
	string.jsize                                         # => 6

String#count is a method that takes a strong of bytes, and counts how many times those bytes occurs in the string. String#jcount takes a string of characters and counts how many times those characters occur in the string:


	string.count "\xef\xbc\xa2"                          # => 13
	string.jcount "\xef\xbc\xa2"                         # => 1

String#count treats "\xef\xbc\xa2" as three separate bytes, and counts the number of times each of those bytes shows up in the string. String#jcount TReats the same string as a single character, and looks for that character in the string, finding it only once.


	"\xef\xbc\xa2".length                                # => 3
	"\xef\xbc\xa2".jlength                               # => 1

Apart from these differences, Ruby handles most Unicode behind the scenes. Once you have your data in UTF-8 format, you really don't have to worry. Given that Ruby's creator Yukihiro Matsumoto is Japanese, it is no wonder that Ruby handles Unicode so elegantly.

Programming Documents

Wednesday, October 28, 2009

Recipe 1.14. Handling International Encodings

Recipe 1.14. Handling International Encodings

Problem

Solution

Discussion

See Also

No comments:

Post a Comment

Blog Archive

About Me

Followers

Link