This post is intended to discuss how to handle traditional information retrieval in cross-language setting. This environment mainly allows for measuring the similarity between the information need (query) and information (indexed documents) across languages. The cross-language similarity estimation techniques can be generally categorised in two sets, depending on the similarity space: i) vector space, and ii) latent space. The former models aim to calculate the similarity in the vector space i.e. the terms are matched across languages using some external resources like bilingual dictionaries, transliteration systems, wordnets etc. The latter models measure the similarity in latent concept space. This post aims to review the former (vector space based) approach. Future post will include some discussion on latent space models to address CLIR.
This post assumes that the reader has basic knowledge about vector space based models like TF-IDF weighting schemes and cosine similarity. Also, the awareness of basic information retrieval (IR) modules like indexing and retrieval is assumed. If you are not sure about it, just go through a couple of pages at
this link. The focus will be to explore more on the cross-language part of the IR problem.
As briefly discussed, in vector space the documents are compared in the original high dimensional space. In order to handle cross-language similarity either a query or the documents are translated to the language of comparison. This post considers the query-translation based approach where query is translated to the language of comparison. How the cross-language variant of IR system differs from its mono-lingual counterpart can be seen in the following figure.
As can be seen in the Figure above, there is an additional module called "Query Translation Module" in CL variant. This module may be supported by cross-language resources like Bilingual dictionary, transliteration system or a complete machine-translation system like Google Translate API. In order to facilitate these systems, you may also require a morphological analyser and or stemmer. Usually you pass the query through a term-pipeline same as the documents when they were indexed. Below is more details on the term-pipeline.
1. Indexing
First of all you want to index the source collection using a standard IR toolkit like Lucene, Terrier, Lemur, Xapian etc. There are different ways to index the collection with different indexing configurations. The most common ways are
- either to remove stop-words or not
- either to stem the terms or not, if yes then with what stemmer,
- what should be considered a unit of terms like single-word-gram, n-word-gram or n-character-gram.
Hint for n-gram indexing
If your favourite IR library does not have n-gram (word/character) tokeniser than you can safely achieve this by a quick pre-processing script. For example, if you want to index 2-word-grams, joint each consecutive words to appear as 2-word grams
by connecting them and putting a delimiter (in this case, #) between
the terms just for the sake of readability.
Input = This is an example of n-grams.
Processed Input = This#is is#an an#example example#of of#ngrams
Don't forget to tokenise the query also the same way!
2. Query Translation
To achieve this, you first have to decide what you want to use and an availability of that for that particular language. If you have the access and the luxury, then you can get your query completely translated to the language of comparison by a complete MT system like Google Translate, Bing Translator, Apertium etc. Usually it is costly (also monetarily). Alternatively, you can use cross-lingual resources and systems like dictionaries (hand-made or statistically trained), transliteration system to normalise the language. Morphological analyser usually helps to look-up in the dictionary and to transliterate. The best way to load a dictionary in the code is to load as a HashMap or something similar for quick look-up. Sometimes this independent modules are connected in a sequential way, for example, if the term is not found in the dictionary then transliterate and so on. But it completely depends on the application and your expectation.
3. Retrieval (Ranking)
Finally you are ready to measure the similarity in the language of comparison. The vector space based models like TF-IDF, BM25 (probabilistic) and Language Models are the most popular ones. Usually all the toolkits have infrastructure to retrieve the document with its score based on the model you selected.
All these steps of IR system with
Terrier are explained in the slides and
the code is publicly available here.
I hope with this post you can start your basic CLIR system using one of your favourite IR toolkit.