Monday, July 29, 2013

Cross-Language Information Retrieval - Resource based Approach

This post is intended to discuss how to handle traditional information retrieval in cross-language setting. This environment mainly allows for measuring the similarity between the information need (query) and information (indexed documents) across languages. The cross-language similarity estimation techniques can be generally categorised in two sets, depending on the similarity space: i) vector space, and ii) latent space. The former models aim to calculate the similarity in the vector space i.e. the terms are matched across languages using some external resources like bilingual dictionaries, transliteration systems, wordnets etc. The latter models measure the similarity in latent concept space. This post aims to review the former (vector space based) approach. Future post will include some discussion on latent space models to address CLIR.

This post assumes that the reader has basic knowledge about vector space based models like TF-IDF weighting schemes and cosine similarity. Also, the awareness of basic information retrieval (IR) modules like indexing and retrieval is assumed. If you are not sure about it, just go through a couple of pages at this link. The focus will be to explore more on the cross-language part of the IR problem.

As briefly discussed, in vector space the documents are compared in the original high dimensional space. In order to handle cross-language similarity either a query or the documents are translated to the language of comparison. This post considers the query-translation based approach where query is translated to the language of comparison. How the cross-language variant of IR system differs from its mono-lingual counterpart can be seen in the following figure.  


As can be seen in the Figure above, there is an additional module called "Query Translation Module" in CL variant. This module may be supported by cross-language resources like Bilingual dictionary, transliteration system or a complete machine-translation system like Google Translate API. In order to facilitate these systems, you may also require a morphological analyser and or stemmer. Usually you pass the query through a term-pipeline same as the documents when they were indexed. Below is more details on the term-pipeline.

1. Indexing

 First of all you want to index the source collection using a standard IR toolkit like Lucene, Terrier, Lemur, Xapian etc. There are different ways to index the collection with different indexing configurations. The most common ways are
 -  either to remove stop-words or not
 -  either to stem the terms or not, if yes then with what stemmer,
 - what should be considered a unit of terms like single-word-gram, n-word-gram or n-character-gram.

Hint for n-gram indexing
If your favourite IR library does not have n-gram (word/character) tokeniser than you can safely achieve this by a quick pre-processing script. For example, if you want to index 2-word-grams, joint each consecutive words to appear as 2-word grams by connecting them and putting a delimiter (in this case, #) between the terms just for the sake of readability.


Input = This is an example of n-grams.
Processed Input = This#is is#an an#example example#of of#ngrams

Don't forget to tokenise the query also the same way!


2. Query Translation

To achieve this, you first have to decide what you want to use and an availability of that for that particular language. If you have the access and the luxury, then you can get your query completely translated to the language of comparison by a complete MT system like Google Translate, Bing Translator, Apertium etc. Usually it is costly (also monetarily). Alternatively, you can use cross-lingual resources and systems like dictionaries (hand-made or statistically trained), transliteration system to normalise the language. Morphological analyser usually helps to look-up in the dictionary and to transliterate. The best way to load a dictionary in the code is to load as a HashMap or something similar for quick look-up. Sometimes this independent modules are connected in a sequential way, for example, if the term is not found in the dictionary then transliterate and so on. But it completely depends on the application and your expectation.

3. Retrieval (Ranking)

Finally you are ready to measure the similarity in the language of comparison. The vector space based models like TF-IDF, BM25 (probabilistic) and Language Models are the most popular ones. Usually all the toolkits have infrastructure to retrieve the document with its score based on the model you selected.

All these steps of IR system with Terrier are explained in the slides and the code is publicly available here.

I hope with this post you can start your basic CLIR system using one of your favourite IR toolkit.

Sunday, April 7, 2013

Mortality Vs. Saturation

Recently I read an article. Between the lines, it passed a very strong message of how short the life is. The article was about making a balance in life, giving time to family and loved ones because "having a promotion on the day of your break up is not enjoyable" and so on.. It was discouraging the idea of working stressfully on the weekdays and giving time to the family *only* on the weekends. It very objectively said, if you live say more 50 years from now then, its just 2600 weekends in your life left. So make a balance and live each day fully. Okay, the message is very ordinary which we hear often. But I must say, I had never seen so direct measurement of my life and visualise the end of the life in few hundred weeks time. This is directly related to mortality.

A few days later, a thought on  Indian Vedic Ashram System came to my  mind. More objectively the third stage "Vanaprasthashram". Most of the above-average-ambitious people think to achieve a goal for their life or professional life (more suitable here) and they think, after that they would like to get settled and take it easy with life. Ambition is a very good virtue and mostly the success is driven by an ambition. But ambition is quite dangerous as well. How? You will know in a minute. Now lets talk about over-ambition. The problem with over-ambitious people is, most of the time what they had tried to achieve, sooner or later they could achieve. And that's how they had become over-ambitious from being ambitious. Nobody is born over-ambitious, its a gradual transformation from being ambitious. Now here is the catch, for them, finding a new height in the life is not difficult and most of the time they know, they can do it and how. This becomes a loop and they find joy in it even though at some point their family stops enjoying it. Naturally, time flies and finally, how much does it take for a weekend to pass  when you are chasing a longer goal (may be one after another). These people start noticing harm in this when their body starts to show them the age.

Now let me relate this to Saturation. Saturation from life. This is a feeling of relief in life. Which in Hindi is called "Santripti".

For example,

Situation 1: You just have drunk two glasses of water and somebody told you that there is no water in the world for 2 hours.

Situation 2: You are damn thirsty and somebody told you that there is no water in the world for 2 hours.

Yes, this is as simple as that. The over-ambitious people mostly find themselves in the Situation 2. Many people just get settled and start enjoying life. Why some people desire immortality so much? Because they are thirsty. They did not have enough water before water was gone. If you have had enough water already, you really don't care about water and that's the ideal situation. I think the worst case can be "dying thirsty". I think the whole point of Vanaprasthashram is to make you drink enough water and minimize the probability of dying thirsty to almost zero. So, don't forget to drink enough water!

Wednesday, January 30, 2013

Installing the Printer with IP address in Fedora KDE

Whenever you move to a new place, it takes a while to get started at  your workplace. Especially if your favourite OS is linux. It happens to me quite often that whenever I go to a new office, I need to setup OS, printers, softwares, checkout SVN repositories etc.

My favourite flavour is Fedora with KDE. So here are some simple tips if you looking to set up a printer with Fedora KDE. Usually the information you should know is the IP address of the printer and its model number (exact). Mostly the companies' or departments' IT help-desk or personnel have the user guidelines to access the resources from the variety of OS but sometimes they become so selfish to provide information just for Windows and/or Mac. Nonetheless, they can provide you the IP address and the model number.

Go to the "System Setting" and choose "Printers". Now you will see there are some options under Network Printers. I see that in Fedora 18 they have come up with the feature which automatically tries to find the connected printers in the network. At least for me it is not working properly but you may give it a try if it does the job for you. Otherwise, if you have this information that your printer protocol is "IPP" or "LPD/LPR" then choose that option and fill the necessary fields like host -> IP address, "queue" is usually the name/id of the printer on that host. Otherwise, if only the IP address is known, safe option is to go for "AppSocket/HP JetDirect".

In case of "AppSocket/HP JetDirect",

  • Enter the IP address. Dont change the automatically filled port number except you know exactly what it is. Usually it is 9100.
  • Click on the "Next".
  • Here you have to supply the proper driver for your printer. Most of the companies and their printer model drivers are already available in the list. If yours is there select it. For example, mine is HP LaserJet P4015dn. If your printer is HP then go for the one with CUPS. If not then try "yum install hplip" which should bring you all the HP drivers in the system and follow the steps from the beginning.
  • Now, once the printer is setup its annoying that system automatically disable/pause/block it. So you have to see from every settings/configurations options that that printer is enabled/resumed/unblocked. Once this is done, try to print the test page. 
  • With this It should be fine. Usually with different fedora systems like 16, 17, 18... things are little bit changed but you should keep on figuring out the above points and in the end you will be able to print something. 
Do let others know by comments if this worked or some problems are still there.