Báo cáo khoa học: Data Cleaning for Word Alignment
Số trang: 9
Loại file: pdf
Dung lượng: 585.16 KB
Lượt xem: 12
Lượt tải: 0
Xem trước 2 trang đầu tiên của tài liệu này:
Thông tin tài liệu:
Parallel corpora are made by human beings. However, as an MT system is an aggregation of state-of-the-art NLP technologies without any intervention of human beings, it is unavoidable that quite a few sentence pairs are beyond its analysis and that will therefore not contribute to the system. Furthermore, they in turn may act against our objectives to make the overall performance worse. Possible unfavorable items are n : m mapping objects, such as paraphrases, non-literal translations, and multiword expressions. This paper presents a pre-processing method which detects such unfavorable items before supplying them to the word aligner under the...
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: "Data Cleaning for Word Alignment" Data Cleaning for Word Alignment Tsuyoshi Okita CNGL / School of Computing Dublin City University, Glasnevin, Dublin 9 tokita@computing.dcu.ie Abstract traction is thus basically restricted in 1 : n or n : 1 with small exceptions. Parallel corpora are made by human be- Firstly, the posterior-based approach (Liang, ings. However, as an MT system is an 06) looks at the posterior probability and partially aggregation of state-of-the-art NLP tech- delays the alignment decision. However, this ap- nologies without any intervention of hu- proach does not have any extension in its 1 : n man beings, it is unavoidable that quite a uni-directional mappings in its word alignment. few sentence pairs are beyond its analy- Secondly, the aforementioned phrase alignment sis and that will therefore not contribute (Marcu and Wong, 02) considers the n : m map- to the system. Furthermore, they in turn ping directly bilingually generated by some con- may act against our objectives to make the cepts without word alignment. However, this ap- overall performance worse. Possible unfa- proach has severe computational complexity prob- vorable items are n : m mapping objects, lems. Thirdly, linguistic motivated phrases, such such as paraphrases, non-literal transla- as a tree aligner (Tinsley et al., 06), provides n : m tions, and multiword expressions. This mappings using some information of parsing re- paper presents a pre-processing method sults. However, as the approach runs somewhat in which detects such unfavorable items be- a reverse direction to ours, we omit it from the dis- fore supplying them to the word aligner cussion. Hence, this paper will seek for the meth- under the assumption that their frequency ods that are different from those approaches and is low, such as below 5 percent. We show whose computational cost is cheap. an improvement of Bleu score from 28.0 n : m mappings in our discussion include para- to 31.4 in English-Spanish and from 16.9 phrases (Callison-Burch, 07; Lin and Pantel, 01), to 22.1 in German-English. non-literal translations (Imamura et al., 03), mul- tiword expressions (Lambert and Banchs, 05), and1 Introduction some other noise in one side of a translation pairPhrase alignment (Marcu and Wong, 02) has re- (from now on, we call these ‘outliers’, meaningcently attracted researchers in its theory, although that these are not systematic noise). One com-it remains in infancy in its practice. However, a mon characteristic of these n : m mappings isphrase extraction heuristic such as grow-diag-final that they tend to be so flexible that even an ex-(Koehn et al., 05; Och and Ney, 03), which is a sin- haustive list by human beings tends to be incom-gle difference between word-based SMT (Brown plete (Lin and Pantel, 01). There are two caseset al., 93) and phrase-based SMT (Koehn et al., which we should like to distinguish: when we use03) where we construct word-based SMT by bi- external resources and when we do not. For ex-directional word alignment, is nowadays consid- ample, Quirk et al. employ external resources byered to be a key process which leads to an over- drawing pairs of English sentences from a compa-all improvement of MT systems. However, tech- rable corpus (Quirk et al., 04), while Bannard andnically, this phrase extraction process after word Callison-Burch (Bannard and Callison-Burch, 05)alignment is known to have at least two limita- identified English paraphrases by pivoting throughtions: 1) the objectives of uni-directional word phrases in another language. However, in this pa-alignment is limited only in 1 : n mappings and per our interest is rather the case when our re-2) an atomic unit of phrase pair used by phrase ex- sources are limited within our parallel corpus. ...
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: "Data Cleaning for Word Alignment" Data Cleaning for Word Alignment Tsuyoshi Okita CNGL / School of Computing Dublin City University, Glasnevin, Dublin 9 tokita@computing.dcu.ie Abstract traction is thus basically restricted in 1 : n or n : 1 with small exceptions. Parallel corpora are made by human be- Firstly, the posterior-based approach (Liang, ings. However, as an MT system is an 06) looks at the posterior probability and partially aggregation of state-of-the-art NLP tech- delays the alignment decision. However, this ap- nologies without any intervention of hu- proach does not have any extension in its 1 : n man beings, it is unavoidable that quite a uni-directional mappings in its word alignment. few sentence pairs are beyond its analy- Secondly, the aforementioned phrase alignment sis and that will therefore not contribute (Marcu and Wong, 02) considers the n : m map- to the system. Furthermore, they in turn ping directly bilingually generated by some con- may act against our objectives to make the cepts without word alignment. However, this ap- overall performance worse. Possible unfa- proach has severe computational complexity prob- vorable items are n : m mapping objects, lems. Thirdly, linguistic motivated phrases, such such as paraphrases, non-literal transla- as a tree aligner (Tinsley et al., 06), provides n : m tions, and multiword expressions. This mappings using some information of parsing re- paper presents a pre-processing method sults. However, as the approach runs somewhat in which detects such unfavorable items be- a reverse direction to ours, we omit it from the dis- fore supplying them to the word aligner cussion. Hence, this paper will seek for the meth- under the assumption that their frequency ods that are different from those approaches and is low, such as below 5 percent. We show whose computational cost is cheap. an improvement of Bleu score from 28.0 n : m mappings in our discussion include para- to 31.4 in English-Spanish and from 16.9 phrases (Callison-Burch, 07; Lin and Pantel, 01), to 22.1 in German-English. non-literal translations (Imamura et al., 03), mul- tiword expressions (Lambert and Banchs, 05), and1 Introduction some other noise in one side of a translation pairPhrase alignment (Marcu and Wong, 02) has re- (from now on, we call these ‘outliers’, meaningcently attracted researchers in its theory, although that these are not systematic noise). One com-it remains in infancy in its practice. However, a mon characteristic of these n : m mappings isphrase extraction heuristic such as grow-diag-final that they tend to be so flexible that even an ex-(Koehn et al., 05; Och and Ney, 03), which is a sin- haustive list by human beings tends to be incom-gle difference between word-based SMT (Brown plete (Lin and Pantel, 01). There are two caseset al., 93) and phrase-based SMT (Koehn et al., which we should like to distinguish: when we use03) where we construct word-based SMT by bi- external resources and when we do not. For ex-directional word alignment, is nowadays consid- ample, Quirk et al. employ external resources byered to be a key process which leads to an over- drawing pairs of English sentences from a compa-all improvement of MT systems. However, tech- rable corpus (Quirk et al., 04), while Bannard andnically, this phrase extraction process after word Callison-Burch (Bannard and Callison-Burch, 05)alignment is known to have at least two limita- identified English paraphrases by pivoting throughtions: 1) the objectives of uni-directional word phrases in another language. However, in this pa-alignment is limited only in 1 : n mappings and per our interest is rather the case when our re-2) an atomic unit of phrase pair used by phrase ex- sources are limited within our parallel corpus. ...
Tìm kiếm theo từ khóa liên quan:
Data Cleaning Word Alignment Tsuyoshi Okita báo cáo khoa học báo cáo ngôn ngữ xử lý ngôn ngữ tự nhiênTài liệu cùng danh mục:
-
Đề tài nghiên cứu khoa học: Kỹ năng quản lý thời gian của sinh viên trường Đại học Nội vụ Hà Nội
80 trang 1526 4 0 -
Tiểu luận: Phương pháp Nghiên cứu Khoa học trong kinh doanh
27 trang 472 0 0 -
57 trang 333 0 0
-
44 trang 297 0 0
-
19 trang 289 0 0
-
63 trang 286 0 0
-
báo cáo chuyên đề GIÁO DỤC BẢO VỆ MÔI TRƯỜNG
78 trang 284 0 0 -
13 trang 261 0 0
-
95 trang 258 1 0
-
80 trang 254 0 0
Tài liệu mới:
-
Đề thi giữa học kì 1 môn Toán lớp 7 năm 2022-2023 - Trường TH&THCS Nguyễn Chí Thanh
15 trang 0 0 0 -
60 trang 0 0 0
-
Luận văn: Nâng cao hiệu quả huy động vốn tại NHNo&PTNT thành phố Vinh
52 trang 0 0 0 -
172 trang 0 0 0
-
7 trang 0 0 0
-
Khảo sát lực cắn tối đa của phục hình tháo lắp toàn hàm hai hàm
6 trang 0 0 0 -
6 trang 0 0 0
-
Kết quả khởi phát chuyển dạ bằng oxytocin và dinoprostone trên thai trên 37 tuần
7 trang 0 0 0 -
Nghiên cứu chuyển đổi dạng bào chế của thược dược cam thảo thang sang dạng thạch
8 trang 0 0 0 -
7 trang 0 0 0