Summary of Computer science master thesis: Enhancing the quality of machine translation system using cross lingual word embedding models

Số trang: 14 Loại file: pdf Dung lượng: 161.38 KB Lượt xem: 7 Lượt tải: 0

tailieu_vip

Phí lưu trữ: 14,000 VND

Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

The purpose of this thesis is to propose two models for using cross-lingual word embedding models to address the above impediment. The first model enhances the quality of the phrase-table in SMT, and the remaining model tackles the unknown word problem in NMT.
Nội dung trích xuất từ tài liệu:
Summary of Computer science master thesis: Enhancing the quality of machine translation system using cross lingual word embedding models VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERNING AND TECHNOLOGY NGUYEN MINH THUAN ENHANCING THE QUALITY OF MACHINE TRANSLATION SYSTEM USING CROSS-LINGUAL WORD EMBEDDING MODELS Major: Computer Science Code: 8480101.01 SUMMARY OF COMPUTER SCIENCE MASTER THESIS SUPERVISOR: Associate Professor Nguyen Phuong Thai Publication: Minh-Thuan Nguyen, Van-Tan Bui, Huy-Hien Vu, Phuong-ThaiNguyen, Chi-Mai Luong, Enhancing the quality of Phrase-table in Statistical MachineTranslation for Less-Common and Low-Resource Languages, in the 2018 InternationalConference on Asian Language Processing (IALP 2018). Hanoi, 10/2018 2Chapter 1: Introduction This chapter introduces the motivation of the thesis, relatedworks and our proposed models. Nowadays, machinetranslation systems attain much success in practice, and twoapproaches that have been widely used for MT are Phrase-based statistical machine translation (PBSMT) and NeuralMachine Translation (NMT). In PBSMT, having a goodphrase-table possibly makes translation systems improve thequality of translation. However, attaining a rich phrase-table isa challenge since the phrase-table is extracted and trained fromlarge amounts of bilingual corpora which require much effortand financial support, especially for less-common languagessuch as Vietnamese, Laos, etc. In the NMT system, To reducethe computational complexity, conventional NMT systemsoften limit their vocabularies to be the top 30K-80K mostfrequent words in the source and target language, and allwords outside the vocabulary, called unknown words, arereplaced into a single unk symbol. This approach leads to theinability to generate the proper translation for this unknownwords during testing. Latterly, there are several approaches to address the aboveimpediments. Especially, techniques using word embeddingreceive much interest from natural language processingcommunities. Word embedding is a vector representation ofwords which conserves semantic information and theircontexts words. Additionally, we can exploit the advantage ofembedding to represent words in diverse distinction spaces.Besides, cross-lingual word embedding models are alsoreceiving a lot of interest, which learn cross-lingual 3representations of words in a joint embedding space torepresent meaning and transfer knowledge in cross-lingualscenarios. Inspired by the advantages of the cross-lingualembedding models, we propose a model to enhance the qualityof a phrase-table by recomputing the phrase weights andgenerating new phrase pairs for the phrase-table, and a modelto address the unknown word problem in the NMT system byreplacing the unknown words with the most appropriate in-vocabulary words. The rest of this thesis is organized as follows: Chapter 2gives an overview of related backgrounds. In Chapter 3, wedescribe our two proposed models. A model enhances thequality of phrase-table in SMT, and the remaining modeltackles the unknown word problem in NMT. Settings andresults of our experiments are shown in Chapter 4. We indicateour conclusion and future works in Chapter 5. 4Chapter 2: Literature review2.1 Machine Translation This section shows the history, approaches, evaluation andopen-source in MT.2.1.1 History In the mid-1930s, Georges Artsrouni attempted to build“translation machines” by using paper tape to create anautomatic dictionary. After that, Peter Troyanskii proposed amodel including a bilingual dictionary and a method forhandling grammatical issues between languages based on theEsperanto’s grammatical system. During the 2000s, researchin MT has seen major changes. A lot of research has focusedon example-based machine translation and statistical machinetranslation (SMT). Besides, researchers also gave moreinterests in hybridization by combining morphological andsyntactic knowledge into statistical systems, as well ascombining statistics with existing rule-based systems.Recently, the hot trend of MT is using a large artificial neuralnetwork into MT, called Neural Machine Translation (NMT).In 2014, (Cho et al., 2014) published the first paper on usingneural networks in MT, followed by a lot of research in thefollowing few years.2.1.2 Approaches In this section, we indicate typically approaches for MTbased on linguistic rules, statistical and neural network. Theseare Rule-based Machine Translation (RBMT), StatisticalMachine Translation (STM), Example-based machinetranslation (EBMT), and Neural Machine Translation (NMT). ...