Wednesday, October 28, 2009

Recipe 1.14. Handling International Encodings










Recipe 1.14. Handling International Encodings





Problem


You need to handle strings that contain nonASCII characters: probably
Unicode characters encoded in UTF-8.




Solution


To use Unicode in Ruby, simply add the following to the beginning of code.



$KCODE='u'
require 'jcode'



You can also invoke the Ruby interpreter with arguments that do the same thing:



$ ruby -Ku -rjcode



If you use a Unix environment, you can add the arguments to the shebang line of your Ruby application:



#!/usr/bin/ruby -Ku -rjcode



The jcode library overrides most of the methods of String and makes them capable of handling multibyte text. The exceptions are String#length, String#count, and String#size, which are not overridden. Instead jcode defines three new methods: String#jlength, string#jcount, and String#jsize.




Discussion


Consider a UTF-8 string that encodes six Unicode characters: efbca1 (A), efbca2 (B), and so on up to UTF-8 efbca6 (F):



string = "\xef\xbc\xa1" + "\xef\xbc\xa2" + "\xef\xbc\xa3" +
"\xef\xbc\xa4" + "\xef\xbc\xa5" + "\xef\xbc\xa6"



The string contains 18 bytes that encode 6 characters:



string.size # => 18
string.jsize # => 6



String#count is a method that takes a strong of bytes, and counts how many times those bytes occurs in the string. String#jcount takes a string of characters and counts how many times those characters occur in the string:



string.count "\xef\xbc\xa2" # => 13
string.jcount "\xef\xbc\xa2" # => 1



String#count treats "\xef\xbc\xa2" as three separate bytes, and counts the number of times each of those bytes shows up in the string. String#jcount TReats the same string as a single character, and looks for that character in the string, finding it only once.



"\xef\xbc\xa2".length # => 3
"\xef\xbc\xa2".jlength # => 1



Apart from these differences, Ruby handles most Unicode behind the scenes. Once you have your data in UTF-8 format, you really don't have to worry. Given that Ruby's creator Yukihiro Matsumoto is Japanese, it is no wonder that Ruby handles Unicode so elegantly.




See Also


  • If you have text in some other encoding and need to convert it to UTF-8, use the iconv library, as described in Recipe 11.2, "Extracting Data from a Document's Tree Structure"

  • There are several online search engines for Unicode characters; two good ones are at http://isthisthingon.org/unicode/ and http://www.fileformat.info/info/unicode/char/search.htm













No comments:

Post a Comment