Author: K.V. Kurmanath
Publication: The Hindu
Date: May 18, 2009
URL: http://www.thehindubusinessline.com/bline/ew/2009/05/18/stories/2009051850070300.htm
Word for word: More on the Sampark initiative
to enrich translation..
Look at the sentence - The Chair chairs the
meeting. How will a machine understand this?
Telugus, Kannadigas and Malayalis can read
Subrahmanya Bharati, the legendary Tamil poet, and relish the sweetness in
his poetry. Similarly, Premchand, Tagore, M T Vasudeva Nair, and U R Ananthamurthy
too could be read and understood by readers in other languages.
All this will soon be a reality, thanks to
a project initiated by IIIT (Hyderabad) and eight other universities and institutes.
To be precise, the beta translation solutions of a few languages will go live
next month (June 2009).
The project, whose public Internet interface
will be known as Sampark, will let users translate texts among various Indian
languages. All one needs to do is copy-paste the text in an appointed box
and press 'enter', and get the translated version in another box beside it.
Not just text, you can translate the whole of a Web page. Copy the URL (a
site's Web address) and paste it in the relevant box in Sampark's Web site.
"You will get the translated page, with photos and other images intact,"
says Prof Rajeev Sangal, Director of IIIT (Hyderabad), who is leading the
team.
The nine institutions have roped in over 120
experts in computer engineering, language, and translators to take up the
'machine translation' programme, which is aimed at breaking the language barrier.
The project is broadly divided into two areas.
Translation of the four Southern languages into Hindi (vice versa too) and
translation of Bangla, Punjabi, Marathi and Urdu into Hindi (and back). Simultanesouly,
the consortium is working on direct translations among Telugu-Tamil, Malayalam-Tamil.
To begin with, the consortium has put two 'systems' Punjabi-Hindi and Urdu-Hindi
beta versions live. "By June 2009 end, we will be adding Tamil-Hindi,
Marathi-Hindi and Telugu-Hindi to the project," Prof Sangal says.
How it works
Broadly, the machine translation happens in
three phases - the source side, transfer aspects and the target side action.
The two important factors in translation are grammar and dictionary. "Languages
have many exceptions and idiosynchrosies. These will be addressed effectively,"
Prof Sangal says.
On the source side (the text you want to translate),
the machine analyses the text sentence by sentence and keeps a representation
of the text. The analysis will include morphological analysis, how words are
formed. It will check whether the text carries any local phrases. It will
search for nouns and parts of speech before going for sentence analysis.
In the second phase (transfer phase), the
machine does lexical and grammar transfer. "The grammars of source and
target languages may not be similar. This phase would see change of grammatical
structure. The later phase would involve target language generation."
common architecture
The step-by-step process is done on a common
architecture. This allows for addition of a new language to the project quite
easily. "If you want to add Kashmiri, you need to develop an analyser,
generator and add a Kashmiri-Hindi dictionary. These, in fact, are parallel
dictionaries," Prof Sangal says.
"The project, unlike earlier projects,
hinges on dictionaries that give meanings based on concepts rather than just
meanings," Prof Uma Maheshwar Rao, who is working on the Telugu-Hindi
aspect of the project, says.
Formed by the Union Ministry of Information
Technology in 2006, the consortium comprises IITs (Kharagpur and Bombay),
Anna University, C-DAC, University of Hyderabad, Tamil University Jadhavpur
University, IISC (Bangalore) and IIIT (Allahabad).
Prof Rao, who works at the Centre of Applied
Linguistics and Translation Studies at University of Hyderabad, says the Sampark
project is more advanced than earlier attempts that sought to offer translation
solutions.
The earlier efforts failed to take the meanings
of the words contextually. Citing the example of the word 'bank', he points
out that the earlier efforts would not make out whether it was a bank used
in the expression river bank, or a bank that deals with money.
"In the present project, we cross-link
words with all the synonyms in the other language. This will help resolve
the ambiguity problem, the knottiest one in the translation process,"
he explains.
The immediate task of the consortium is to
add more servers and more engineering to make the machine faster.
"We are going to add three languages
to the system every two months till November," Prof Sangal says.
He, however, admits that it is not a complete
translator. But the beta versions will definitely give a flavour of the meaning
in the source language. You can see improvements constantly, he adds.
Machines learn!
Prof Sangal says the machine can learn based
on the data you give it. Look at the sentence - The Chair chairs the meeting.
How will a machine understand this sentence? The one developed by the consortium,
thanks to the conceptual dictionary, would look at the context and tell apart
the meaning of the two chairs in the sentence. "Earlier, we used to give
rules to the machine to follow. Now, we have algorithms to let the machines
learn from this. We have combined artificial intelligence approach with the
linguistic process," he explains.
More to come
Busy finalising modules, the team members
continue to set their eyes on long-term goals. "We will continue the
long-term research independently and collaboratively. The next stage is to
build more robust sentence analysers. They will be able to do translations
more correctly. The quality of the output will go up," he says. Prof
Sangal, who has been working on machine translation for the last 25 years,
says it is team work that helped the group to give a shape to the machine.
"We discussed several issues physically and through mailing groups. We
have set up sub groups to address specific issues."
English to Indian languages
Simultaneously, a different consortium, in
which IIIT-H is also a member, is working on translations from English to
several Indian languages and back. C-DAC (Pune) is leading the consortium.
The researchers take a different approach.
Unlike popular belief, English is a difficult
language for the machine to understand. "Unlike Indian languages, there
is a high degree of ambiguity. When a machine analyses, it has to do disambiguation,
which is a difficult process," Prof Sangal says. The research team is
almost ready with the English-Hindi version, which is in test mode. At a later
stage, these two different projects could technically work in tandem and offer
users a better translation experience.
- kurmanath@thehindu.co.in