Today the results of the 4th International Competition on Plagiarism Detection Challenge PAN : http://pan.webis.de are declared. We [ I and my collogues] at DA-IICT are regular participants of the competition. Many people ask me that why are you working on this and its a solved problem so what is the scope of research in there?
Well there are pretty good challenges which still require good amount of research. Plagiariser may not be writing the document as a verbatim copy of some other text. He/She may copy the text or matter with slight or high modifications starting from simple addition or deletion of words to more complex manual paraphrasing. With the good translation facilities available, a plagiariser may also translate the contents from source documents. Another challenge may be what to consider as source documents, in some cases you may have to take whole web as source documents. There are plenty of commercial plagiarism detectors available but they are good only when the plagiarised passages are exact copies from the source documents.
Plagiarism Detection has always been a very computationally complex and resource-hungry task. Because it involves sentence to sentence matches [Please don't consider matches as exact matches] and sometimes the documents being compared are some MBs big of raw text.
With the last years proceedings of PAN, it can be seen that many people have tried very innovative ideas to handle above challenges. But translation seems to be still a big challenge.
From translation I can remember of another big challenge. The machine translation technology for Indian languages is still in its infancy phase. You may have read the news of google has extended support for more 5 indian languages in its translation facility but still the quality of machine translation is very bad compared to European languages. You may try it there using some complex or long english statement and trying to translate it in hindi or so.
One such interesting challenge is going to happen in FIRE at IIT Bombay and the task name is CL!TR and its webpage is http://users.dsic.upv.es/grupos/nle/fire-workshop-clitr.html.
Well there are pretty good challenges which still require good amount of research. Plagiariser may not be writing the document as a verbatim copy of some other text. He/She may copy the text or matter with slight or high modifications starting from simple addition or deletion of words to more complex manual paraphrasing. With the good translation facilities available, a plagiariser may also translate the contents from source documents. Another challenge may be what to consider as source documents, in some cases you may have to take whole web as source documents. There are plenty of commercial plagiarism detectors available but they are good only when the plagiarised passages are exact copies from the source documents.
Plagiarism Detection has always been a very computationally complex and resource-hungry task. Because it involves sentence to sentence matches [Please don't consider matches as exact matches] and sometimes the documents being compared are some MBs big of raw text.
With the last years proceedings of PAN, it can be seen that many people have tried very innovative ideas to handle above challenges. But translation seems to be still a big challenge.
From translation I can remember of another big challenge. The machine translation technology for Indian languages is still in its infancy phase. You may have read the news of google has extended support for more 5 indian languages in its translation facility but still the quality of machine translation is very bad compared to European languages. You may try it there using some complex or long english statement and trying to translate it in hindi or so.
One such interesting challenge is going to happen in FIRE at IIT Bombay and the task name is CL!TR and its webpage is http://users.dsic.upv.es/grupos/nle/fire-workshop-clitr.html.
No comments:
Post a Comment