![Phân tích tư tưởng của nhân dân qua đoạn thơ: Những người vợ nhớ chồng… Những cuộc đời đã hóa sông núi ta trong Đất nước của Nguyễn Khoa Điềm](https://timtailieu.net/upload/document/136415/phan-tich-tu-tuong-cua-nhan-dan-qua-doan-tho-039-039-nhung-nguoi-vo-nho-chong-nhung-cuoc-doi-da-hoa-song-nui-ta-039-039-trong-dat-nuoc-cua-nguyen-khoa-136415.jpg)
Báo cáo khoa học: A Wiki-style Platform for Creation of Parallel Data
Số trang: 4
Loại file: pdf
Dung lượng: 771.70 KB
Lượt xem: 12
Lượt tải: 0
Xem trước 2 trang đầu tiên của tài liệu này:
Thông tin tài liệu:
In this demo, we present a wiki-style platform – WikiBABEL – that enables easy collaborative creation of multilingual content in many nonEnglish Wikipedias, by leveraging the relatively larger and more stable content in the English Wikipedia. The platform provides an intuitive user interface that maintains the user focus on the multilingual Wikipedia content creation, by engaging search tools for easy discoverability of related English source material, and a set of linguistic and collaborative tools to make the content translation simple. ...
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: "A Wiki-style Platform for Creation of Parallel Data" WikiBABEL: A Wiki-style Platform for Creation of Parallel Data A Kumaran† K Saravanan† Naren Datha* B Ashok* Vikram Dendi‡ † * ‡ Multilingual Systems Advanced Development & Machine Translation Research Prototyping Incubation Microsoft Research India Microsoft Research India Microsoft Research Wikipedia. WikiBABEL leverages two signifi- Abstract cant facts with respect to Wikipedia data: First, there is a large skew between the content of Eng- In this demo, we present a wiki-style platform – lish and non-English Wikipedias. Second, while WikiBABEL – that enables easy collaborative the original content creation requires subject creation of multilingual content in many non- matter experts, subsequent translations may be English Wikipedias, by leveraging the relatively effectively created by people who are fluent in larger and more stable content in the English English and the target language. In general, we Wikipedia. The platform provides an intuitive user interface that maintains the user focus on do expect the large English Wikipedia to provide the multilingual Wikipedia content creation, by source material for multilingual Wikipedias; engaging search tools for easy discoverability of however on specific topics specific multilingual related English source material, and a set of lin- Wikipedia may provide the source material guistic and collaborative tools to make the con- (http://ja.wikipedia.org/wiki/ 俳句 may be better tent translation simple. We present two different than http://en.wikipedia.org/wiki/haiku). We usage scenarios and discuss our experience in leverage these facts in the WikiBABEL frame- testing them with real users. Such integrated work, enabling a community of interested native content creation platform in Wikipedia may yield as a by-product, parallel corpora that are critical speakers of a language, to create content in their for research in statistical machine translation sys- respective language Wikipedias. We make such tems in many languages of the world. content creation easy by integrating linguistic tools and resources for translation, and collabora-1 Introduction tive mechanism for storing and sharing know- ledge among the users. Such methodology isParallel corpora are critical for research in many expected to generate comparable data (similar,natural language processing systems, especially, but not the same content), from which parallelthe Statistical Machine Translation (SMT) and data may be mined subsequently (Munteanu etCrosslingual Information Retrieval (CLIR) sys- al, 2005) (Quirk et al, 2007).tems, as the state-of-the-art systems are based on We present here the WikiBABEL platform,statistical learning principles; a typical SMT sys- and trace its evolution through two distinct usagetem in a pair of language requires large parallel versions: First, as a standalone deployment pro-corpora, in the order of a few million parallel viding a community of users a translation plat-sentences. Parallel corpora are traditionally form on hosted Wikipedia data to generate paral-created by professionals (in most cases, for busi- lel corpora, and second, as a transparent editness or governmental needs) and are available layer on top of Wikipedias to generate compara-only in a few languages of the world. The prohi- ble corpora. Both paradigms were used for userbitive cost associated with creating new parallel testing, to gauge the usability of the tool and thedata implied that the SMT research was re- viability of the approach for content creation instricted to only a handful of languages of the ...
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: "A Wiki-style Platform for Creation of Parallel Data" WikiBABEL: A Wiki-style Platform for Creation of Parallel Data A Kumaran† K Saravanan† Naren Datha* B Ashok* Vikram Dendi‡ † * ‡ Multilingual Systems Advanced Development & Machine Translation Research Prototyping Incubation Microsoft Research India Microsoft Research India Microsoft Research Wikipedia. WikiBABEL leverages two signifi- Abstract cant facts with respect to Wikipedia data: First, there is a large skew between the content of Eng- In this demo, we present a wiki-style platform – lish and non-English Wikipedias. Second, while WikiBABEL – that enables easy collaborative the original content creation requires subject creation of multilingual content in many non- matter experts, subsequent translations may be English Wikipedias, by leveraging the relatively effectively created by people who are fluent in larger and more stable content in the English English and the target language. In general, we Wikipedia. The platform provides an intuitive user interface that maintains the user focus on do expect the large English Wikipedia to provide the multilingual Wikipedia content creation, by source material for multilingual Wikipedias; engaging search tools for easy discoverability of however on specific topics specific multilingual related English source material, and a set of lin- Wikipedia may provide the source material guistic and collaborative tools to make the con- (http://ja.wikipedia.org/wiki/ 俳句 may be better tent translation simple. We present two different than http://en.wikipedia.org/wiki/haiku). We usage scenarios and discuss our experience in leverage these facts in the WikiBABEL frame- testing them with real users. Such integrated work, enabling a community of interested native content creation platform in Wikipedia may yield as a by-product, parallel corpora that are critical speakers of a language, to create content in their for research in statistical machine translation sys- respective language Wikipedias. We make such tems in many languages of the world. content creation easy by integrating linguistic tools and resources for translation, and collabora-1 Introduction tive mechanism for storing and sharing know- ledge among the users. Such methodology isParallel corpora are critical for research in many expected to generate comparable data (similar,natural language processing systems, especially, but not the same content), from which parallelthe Statistical Machine Translation (SMT) and data may be mined subsequently (Munteanu etCrosslingual Information Retrieval (CLIR) sys- al, 2005) (Quirk et al, 2007).tems, as the state-of-the-art systems are based on We present here the WikiBABEL platform,statistical learning principles; a typical SMT sys- and trace its evolution through two distinct usagetem in a pair of language requires large parallel versions: First, as a standalone deployment pro-corpora, in the order of a few million parallel viding a community of users a translation plat-sentences. Parallel corpora are traditionally form on hosted Wikipedia data to generate paral-created by professionals (in most cases, for busi- lel corpora, and second, as a transparent editness or governmental needs) and are available layer on top of Wikipedias to generate compara-only in a few languages of the world. The prohi- ble corpora. Both paradigms were used for userbitive cost associated with creating new parallel testing, to gauge the usability of the tool and thedata implied that the SMT research was re- viability of the approach for content creation instricted to only a handful of languages of the ...
Tìm kiếm theo từ khóa liên quan:
Wiki-style Platform for Creation Parallel Data Long Papers báo cáo khoa học báo cáo ngôn ngữ xử lý ngôn ngữ tự nhiênTài liệu liên quan:
-
63 trang 331 0 0
-
12 trang 319 0 0
-
Phương pháp tạo ra văn bản tiếng Việt có đề tài xác định
7 trang 276 0 0 -
13 trang 268 0 0
-
Báo cáo khoa học Bước đầu tìm hiểu văn hóa ẩm thực Trà Vinh
61 trang 255 0 0 -
Tóm tắt luận án tiến sỹ Một số vấn đề tối ưu hóa và nâng cao hiệu quả trong xử lý thông tin hình ảnh
28 trang 225 0 0 -
Đề tài nghiên cứu khoa học và công nghệ cấp trường: Hệ thống giám sát báo trộm cho xe máy
63 trang 214 0 0 -
NGHIÊN CỨU CHỌN TẠO CÁC GIỐNG LÚA CHẤT LƯỢNG CAO CHO VÙNG ĐỒNG BẰNG SÔNG CỬU LONG
9 trang 214 0 0 -
Giáo trình Lập trình logic trong prolog: Phần 1
114 trang 205 0 0 -
Đề tài nghiên cứu khoa học: Tội ác và hình phạt của Dostoevsky qua góc nhìn tâm lý học tội phạm
70 trang 193 0 0