Báo cáo khoa học: Data Cleaning for Word Alignment

Số trang: 9 Loại file: pdf Dung lượng: 585.16 KB Lượt xem: 12 Lượt tải: 0

Hoai.2512

Phí lưu trữ: 4,500 VND

Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Parallel corpora are made by human beings. However, as an MT system is an aggregation of state-of-the-art NLP technologies without any intervention of human beings, it is unavoidable that quite a few sentence pairs are beyond its analysis and that will therefore not contribute to the system. Furthermore, they in turn may act against our objectives to make the overall performance worse. Possible unfavorable items are n : m mapping objects, such as paraphrases, non-literal translations, and multiword expressions. This paper presents a pre-processing method which detects such unfavorable items before supplying them to the word aligner under the...
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: "Data Cleaning for Word Alignment" Data Cleaning for Word Alignment Tsuyoshi Okita CNGL / School of Computing Dublin City University, Glasnevin, Dublin 9 tokita@computing.dcu.ie Abstract traction is thus basically restricted in 1 : n or n : 1 with small exceptions. Parallel corpora are made by human be- Firstly, the posterior-based approach (Liang, ings. However, as an MT system is an 06) looks at the posterior probability and partially aggregation of state-of-the-art NLP tech- delays the alignment decision. However, this ap- nologies without any intervention of hu- proach does not have any extension in its 1 : n man beings, it is unavoidable that quite a uni-directional mappings in its word alignment. few sentence pairs are beyond its analy- Secondly, the aforementioned phrase alignment sis and that will therefore not contribute (Marcu and Wong, 02) considers the n : m map- to the system. Furthermore, they in turn ping directly bilingually generated by some con- may act against our objectives to make the cepts without word alignment. However, this ap- overall performance worse. Possible unfa- proach has severe computational complexity prob- vorable items are n : m mapping objects, lems. Thirdly, linguistic motivated phrases, such such as paraphrases, non-literal transla- as a tree aligner (Tinsley et al., 06), provides n : m tions, and multiword expressions. This mappings using some information of parsing re- paper presents a pre-processing method sults. However, as the approach runs somewhat in which detects such unfavorable items be- a reverse direction to ours, we omit it from the dis- fore supplying them to the word aligner cussion. Hence, this paper will seek for the meth- under the assumption that their frequency ods that are different from those approaches and is low, such as below 5 percent. We show whose computational cost is cheap. an improvement of Bleu score from 28.0 n : m mappings in our discussion include para- to 31.4 in English-Spanish and from 16.9 phrases (Callison-Burch, 07; Lin and Pantel, 01), to 22.1 in German-English. non-literal translations (Imamura et al., 03), mul- tiword expressions (Lambert and Banchs, 05), and1 Introduction some other noise in one side of a translation pairPhrase alignment (Marcu and Wong, 02) has re- (from now on, we call these ‘outliers’, meaningcently attracted researchers in its theory, although that these are not systematic noise). One com-it remains in infancy in its practice. However, a mon characteristic of these n : m mappings isphrase extraction heuristic such as grow-diag-ﬁnal that they tend to be so ﬂexible that even an ex-(Koehn et al., 05; Och and Ney, 03), which is a sin- haustive list by human beings tends to be incom-gle difference between word-based SMT (Brown plete (Lin and Pantel, 01). There are two caseset al., 93) and phrase-based SMT (Koehn et al., which we should like to distinguish: when we use03) where we construct word-based SMT by bi- external resources and when we do not. For ex-directional word alignment, is nowadays consid- ample, Quirk et al. employ external resources byered to be a key process which leads to an over- drawing pairs of English sentences from a compa-all improvement of MT systems. However, tech- rable corpus (Quirk et al., 04), while Bannard andnically, this phrase extraction process after word Callison-Burch (Bannard and Callison-Burch, 05)alignment is known to have at least two limita- identiﬁed English paraphrases by pivoting throughtions: 1) the objectives of uni-directional word phrases in another language. However, in this pa-alignment is limited only in 1 : n mappings and per our interest is rather the case when our re-2) an atomic unit of phrase pair used by phrase ex- sources are limited within our parallel corpus. ...