DARPA grant exploring auto-translation of Chinese

University receives $2 million grant to work on data prep

Nianwen Xue

As the United States and China move forward in both collaboration and competition, the ability to communicate becomes ever more critical. While emerging technologies such as Google Translate have shown promise, much work must be done to improve the language translation applications that America will need as one of its most important 21st century relationships develops.

To move the technology forward, the Defense Advanced Research Projects Agency (DARPA) has awarded a $13.7 million grant, called “Linguistic Resources for Multilingual, Genre-Independent Language Technologies via its Broad Operational Language Translation” (BOLT) Program to the Linguistic Data Consortium at the University of Pennsylvania to develop linguistic resources. Brandeis has been given $2 million of that amount as a collaborator. Nianwen Xue, assistant professor of linguistics in the Language and Linguistics Program and the Department of Computer Science at Brandeis, is the principal investigator on the four-year project. Xue’s team is focused on data preparation.

“The goal is to develop the technologies and in order to get those technologies we have to create resources annotated with linguistic structures,” says Xue. “We will then use machine-learning technologies to produce annotations automatically.”

Collaborators and other participants in the BOLT Program will use the products of the Brandeis team to develop their own systems.

Xue has been involved in this research program for years, beginning with Translingual Information Detection, Extraction and Summarization (TIDES) then Global Autonomous Language Exploitation (GALE), a five-year program that he worked on while at the University of Colorado. When GALE grant funding ended last year, DARPA continued the project with BOLT.

“When you translate one language into another, there are many challenges,” Xue says. “Chinese words, for example, have no spaces between them. One of the first things that must be done is enter spacing that indicate word boundaries, a process called ‘word segmentation.’”

Categorization — such as verbs and nouns — must also be assigned. Next, proper sentence structure must be recognized— for example words that belong together become a phrase. Then phrases are put together to form sentences.

Then there is the structure on the Chinese side, structure on the English side and the challenge of trying to learn the mapping between them, which is what machine translation systems do. But Xue quickly points out that one cannot translate Chinese into English in a word-for-word fashion for a number of reasons, including the fact that Chinese has no determiners, no “morphological inflections” to represent tense such as present and past, and number such as singular and plural forms.

According to DARPA, BOLT is part of a broader effort to provide language translation in support of defense and national security requirements, ranging from phrase translation to scanning and translation of large data sets.
“Security needs around the world dictate that the United States has access to reliable information that could impact national security or deployed military personnel,” said DARPA in a press release. “Given the vast amount of information in multiple languages and formats, it can be difficult to analyze and determine what’s important. Additionally, there’s a need to be able to readily communicate with local populations of foreign countries and non-English-speaking allies.”

According to their website, Joe Olive, previous program manager at DARPA, says the agency has achieved significant success with GALE, which developed software capable of translating formal Arabic more accurately on first pass than human translators.  Now, with BOLT, they are working to translate Mandarin Chinese and multiple dialects of Arabic into English from all types of media, specifically focused on the challenging task of informal conversational speech, email text and instant messaging.

“BOLT also aims to allow users to conduct English-language queries that retrieve targeted information from multi-lingual sources,” Olive said in the release. “BOLT would give users the ability to conduct robust searches that yield the most relevant results.”

Categories: Research, Science and Technology

Return to the BrandeisNOW homepage