Thursday, November 12, 2009

Section 24.5.  Multilingual Information Retrieval









24.5. Multilingual Information Retrieval



Application developers who work on catalogs, digital libraries, and knowledge repositories are no doubt familiar with information retrieval, or IR. An IR application takes in user-supplied criteria and searches for the items or documents that best match the intent of the user. This is one of the major ways that IR differs from standard SQL queries, which either do or do not find matches for the query criteria. Good IR systems can help determine what documents are about, and return those documents that are most relevant to the search, even if they don't match the search exactly.


Perhaps the most challenging task in IR is to support indexing and querying in multiple languages. English, for example, is a single-byte language that uses whitespace to separate words. Retrieval of information is substantially different when working with Japanese, which is a multibyte character set that does not use whitespace as delimiters.


Oracle Text, an option available in both the Oracle Enterprise Edition and the Oracle Standard Edition, provides full-text IR capabilities. Because Oracle Text uses SQL for index creation, search, and maintenance operations, it works very well in PL/SQL-based applications.


Called ConText and interMedia in prior releases, Oracle Text
really came of age as an information retrieval solution with Oracle9i Database Release 1. With Oracle Text:


  • All NLS character sets are supported.

  • Searching across documents in Western languages, as well as in Korean, Japanese, and Traditional and Simplified Chinese, is possible.

  • Unique characteristics of each language are accommodated.

  • Searches are case-insensitive by default.

  • Cross-language search is supported.


Before a PL/SQL application can be written that searches a data source, Oracle Text indexes must be created. As part of the g11n schema, we create an Oracle Text index on the publication.short_description column. To support multiple languages, we have individual language preferences, as well as a MULTI_LEXER preference, that makes it possible to search across multiple languages with a single query.



Oracle Text Indexes


There are four index types
available with Oracle Text. The first, and most commonly used, is the CONTEXT index. The CONTEXT index can index any character or LOB column, including BFILEs. It uses a filter to extract text from different document types. The INSO filter that is shipped with the data server can filter more than 150 different document types, including Word documents, PDFs, and XML documents.


Once the text is extracted from the document, it is broken into tokens
, or individual terms and phrases, by a LEXER. Language-specific LEXERs are available where required. A MULTI_LEXER actually uses language-specific LEXERs (defined as SUB_LEXERs) to extract the tokens from a multilingual data source. The tokens are stored in Oracle Text index tables and are used during search operations to point to relevant documents. To see the tokens created in this chapter's examples, run the following in your iSQL*Plus session:



SELECT token_text
FROM dr$g11n_index$i;



The result contains English, German, and Japanese tokens.


The other three Oracle Text index types are the CTXCAT, CTXRULE, and CTXXPATH indexes. For additional information regarding their structure, check out the Oracle Text Application Developer's Guide, and the Oracle Text Reference available at http://otn.oracle.com.




We can use the TEXT_SEARCH_FUNC function
that is part of the g11n schema to test some of the multilingual features:



CREATE OR REPLACE FUNCTION text_search_func (v_keyword IN VARCHAR2)
RETURN sys_refcursor
IS
v_title sys_refcursor;
BEGIN
OPEN v_title
FOR
SELECT title, LANGUAGE, score (1)
FROM publication
WHERE contains (short_description, v_keyword, 1) > 0
ORDER BY score (1) DESC;
 
RETURN v_title;
END text_search_func;
/



A call to this function, passing "pl" as the keyword, yields the following result:



variable x refcursor;
call text_search_func('pl') into :x;
print x;



This returns the following result:



TITLE LANGUAGE SCORE(1)
---------------------------------------- -------- --------

JA 18
Oracle PL/SQL Programming, 3rd Edition EN 13
Oracle PL/SQL Programmierung, 2. Auflage DE 9



You find this reference in all three languages because "pl" is common among them. Note that we searched on a lowercase "pl," but the "PL" in the record is uppercase. Our search is case-insensitive by default even though no UPPER function was used.


It may be that some languages should be case-sensitive while others should not be. Language-specific case-sensitivity can be set as part of your language preference creation. Simply add a mixed_case attribute with a value of yes. The tokens will be created mixed case, just as they are stored in your document or column, but only for the language identified in that preference.


Oracle Database 10g makes multilingual IR easier with the introduction of the WORLD_LEXER
. Although it has fewer language-specific features than the MULTI_LEXER, it provides a nearly effortless method of implementation. Instead of relying on a language column, the WORLD_LEXER identifies the text as one of three types of languages based on the code point. The categories are as follows:


  • Asian (Japanese, Korean, Chinese)

  • Arabic

  • Other (any non-Arabic Western European, or English language)


The text is broken into tokens based on the category in which it falls. Both Arabic and other categories are separated by whitespace because they're easily divided into tokens. Asian characters are more complex because they aren't whitespace delimited so they are broken into overlapping tokens of two characters at a time. For example, the three-character string of is broken into two tokens, and .


Oracle Text provides additional features as well, depending on the language. For details including language-specific features and restrictions, see the Oracle Text documentation provided on the OTN web site.



24.5.1. IR and PL/SQL




I have designed and implemented some extremely large and complex record management systems and digital libraries. Based on my experiences, I have found that nothing beats PL/SQL for search and maintenance operations with Oracle Text. PL/SQL's tight integration to the database server, and its improved performance over the last few releases, makes stored PL/SQL program units the language of choice for these types of applications.


This is even more evident when working with multiple languages. The shared SQL and PL/SQL parser means that there is consistent handling of characters and character semantics, regardless of the language being indexed and searched.


One of the first projects most Oracle Text programmers undertake is to find a way to format strings for search. The following example creates a function that formats search strings for Oracle Text:



CREATE OR REPLACE FUNCTION format_string (p_search IN VARCHAR2)
RETURN VARCHAR2
AS
-- Define an associative array
TYPE token_table IS TABLE OF VARCHAR2 (500 CHAR)
INDEX BY PLS_INTEGER;
 
-- Define an associative array variable
v_token_array token_table;
v_temp_search_string VARCHAR2 (500 CHAR);
v_final_search_string VARCHAR2 (500 CHAR);
v_count PLS_INTEGER := 0;
v_token_count PLS_INTEGER := 0;
BEGIN
v_temp_search_string := TRIM (UPPER (p_search));
-- Find the max number of tokens
v_token_count :=
lengthc (v_temp_search_string)
- lengthc (REPLACE (v_temp_search_string, ' ', ''))
+ 1;
 
-- Populate the associative array
FOR y IN 1 .. v_token_count
LOOP
v_count := v_count + 1;
v_token_array (y) :=
regexp_substr (v_temp_search_string, '[^[:space:]]+', 1, v_count);
-- Handle reserved words
v_token_array (y) := TRIM (v_token_array (y));
 
IF v_token_array (y) IN ('ABOUT', 'WITHIN')
THEN
v_token_array (y) := '{' || v_token_array (y) || '}';
END IF;
END LOOP;
 
v_count := 0;
 
FOR y IN v_token_array.FIRST .. v_token_array.LAST
LOOP
v_count := v_count + 1;
 
-- First token processed
IF ( (v_token_array.LAST = v_count OR v_count = 1)
AND v_token_array (y) IN ('AND', '&', 'OR', '|')
)
THEN
v_final_search_string := v_final_search_string;
ELSIF (v_count <> 1)
THEN
-- Separate by a comma unless separator already present
IF v_token_array (y) IN ('AND', '&', 'OR', '|')
OR v_token_array (y - 1) IN ('AND', '&', 'OR', '|')
THEN
v_final_search_string :=
v_final_search_string || ' ' || v_token_array (y);
ELSE
v_final_search_string :=
v_final_search_string || ', ' || v_token_array (y);
END IF;
ELSE
v_final_search_string := v_token_array (y);
END IF;
END LOOP;
 
-- Escape special characters in the final string
v_final_search_string :=
TRIM (REPLACE (REPLACE (v_final_search_string,
'&',
' & '
),
';',
' ; '
)
);
RETURN (v_final_search_string);
END format_string;
/



This is designed to break terms, or tokens, from the string using the space between the characters. It uses character semantics for variable declarations, including the declaration of the associative array.


To test this with an English string, run this SELECT:



SELECT format_string('oracle PL/SQL') AS "Formatted String"
FROM dual;



This returns the following result:



Formatted String
-----------------
ORACLE, PL/SQL



The FORMAT_STRING function
separates terms with a comma by default, so an exact match is not required. A string of characters that is not whitespace-delimited will look exactly the way it was entered. The following example illustrates this using a mix of English and Japanese characters:



SELECT format_string('
') AS "Formatted String"
FROM dual;



Passing this mixed character string to the FORMAT_STRING function returns the following result:



Formatted String
-----------------

 



Where spaces delimit terms in the text, a comma is added regardless of the language.


The following CONTAINS search uses the FORMAT_STRING function:



SELECT score (1) "Rank", title
FROM publication
WHERE contains (short_description, format_string('
'), 1) > 0;



This returns the following:



Rank TITLE
------------ ------------
12 Oracle SQL*Plus




Using PL/SQL and Oracle Text, it is possible to index and perform full-text searches on data regardless of character set or language.










    No comments:

    Post a Comment