Báo cáo khoa học: A Wiki-style Platform for Creation of Parallel Data

Số trang: 4 Loại file: pdf Dung lượng: 771.70 KB Lượt xem: 12 Lượt tải: 0

Jamona

Hỗ trợ phí lưu trữ khi tải xuống: miễn phí

Báo xấu

Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

In this demo, we present a wiki-style platform – WikiBABEL – that enables easy collaborative creation of multilingual content in many nonEnglish Wikipedias, by leveraging the relatively larger and more stable content in the English Wikipedia. The platform provides an intuitive user interface that maintains the user focus on the multilingual Wikipedia content creation, by engaging search tools for easy discoverability of related English source material, and a set of linguistic and collaborative tools to make the content translation simple. ...
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: "A Wiki-style Platform for Creation of Parallel Data" WikiBABEL: A Wiki-style Platform for Creation of Parallel Data A Kumaran† K Saravanan† Naren Datha* B Ashok* Vikram Dendi‡ † * ‡ Multilingual Systems Advanced Development & Machine Translation Research Prototyping Incubation Microsoft Research India Microsoft Research India Microsoft Research Wikipedia. WikiBABEL leverages two signifi- Abstract cant facts with respect to Wikipedia data: First, there is a large skew between the content of Eng- In this demo, we present a wiki-style platform – lish and non-English Wikipedias. Second, while WikiBABEL – that enables easy collaborative the original content creation requires subject creation of multilingual content in many non- matter experts, subsequent translations may be English Wikipedias, by leveraging the relatively effectively created by people who are fluent in larger and more stable content in the English English and the target language. In general, we Wikipedia. The platform provides an intuitive user interface that maintains the user focus on do expect the large English Wikipedia to provide the multilingual Wikipedia content creation, by source material for multilingual Wikipedias; engaging search tools for easy discoverability of however on specific topics specific multilingual related English source material, and a set of lin- Wikipedia may provide the source material guistic and collaborative tools to make the con- (http://ja.wikipedia.org/wiki/ 俳句 may be better tent translation simple. We present two different than http://en.wikipedia.org/wiki/haiku). We usage scenarios and discuss our experience in leverage these facts in the WikiBABEL frame- testing them with real users. Such integrated work, enabling a community of interested native content creation platform in Wikipedia may yield as a by-product, parallel corpora that are critical speakers of a language, to create content in their for research in statistical machine translation sys- respective language Wikipedias. We make such tems in many languages of the world. content creation easy by integrating linguistic tools and resources for translation, and collabora-1 Introduction tive mechanism for storing and sharing know- ledge among the users. Such methodology isParallel corpora are critical for research in many expected to generate comparable data (similar,natural language processing systems, especially, but not the same content), from which parallelthe Statistical Machine Translation (SMT) and data may be mined subsequently (Munteanu etCrosslingual Information Retrieval (CLIR) sys- al, 2005) (Quirk et al, 2007).tems, as the state-of-the-art systems are based on We present here the WikiBABEL platform,statistical learning principles; a typical SMT sys- and trace its evolution through two distinct usagetem in a pair of language requires large parallel versions: First, as a standalone deployment pro-corpora, in the order of a few million parallel viding a community of users a translation plat-sentences. Parallel corpora are traditionally form on hosted Wikipedia data to generate paral-created by professionals (in most cases, for busi- lel corpora, and second, as a transparent editness or governmental needs) and are available layer on top of Wikipedias to generate compara-only in a few languages of the world. The prohi- ble corpora. Both paradigms were used for userbitive cost associated with creating new parallel testing, to gauge the usability of the tool and thedata implied that the SMT research was re- viability of the approach for content creation instricted to only a handful of languages of the ...