Báo cáo khoa học: Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation

Số trang: 4 Loại file: pdf Dung lượng: 442.21 KB Lượt xem: 11 Lượt tải: 0

tailieu_vip

Hỗ trợ phí lưu trữ khi tải xuống: miễn phí

Báo xấu

Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

We describe Joshua (Li et al., 2009a)1 , an open source toolkit for statistical machine translation. Joshua implements all of the algorithms required for translation via synchronous context free grammars (SCFGs): chart-parsing, n-gram language model integration, beam- and cubepruning, and k-best extraction. The toolkit also implements sufﬁx-array grammar extraction and minimum error rate training. It uses parallel and distributed computing techniques for scalability.
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: "Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation" Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation∗Zhifei Li, Chris Callison-Burch, Chris Dyer† , Juri Ganitkevitch+ , Sanjeev Khudanpur, Lane Schwartz , Wren N. G. Thornton, Jonathan Weese, and Omar F. Zaidan Center for Language and Speech Processing, Johns Hopkins University † Computational Linguistics and Information Processing Lab, University of Maryland + Human Language Technology and Pattern Recognition Group, RWTH Aachen University Natural Language Processing Lab, University of Minnesota Abstract 2 Joshua Toolkit We describe Joshua (Li et al., 2009a)1 , When designing our toolkit, we applied general an open source toolkit for statistical ma- principles of software engineering to achieve three chine translation. Joshua implements all major goals: Extensibility, end-to-end coherence, of the algorithms required for transla- and scalability. tion via synchronous context free gram- Extensibility: Joshua’s codebase consists of mars (SCFGs): chart-parsing, n-gram lan- a separate Java package for each major aspect guage model integration, beam- and cube- of functionality. This way, researchers can focus pruning, and k-best extraction. The toolkit on a single package of their choosing. Fuur- also implements sufﬁx-array grammar ex- thermore, extensible components are deﬁned by traction and minimum error rate training. Java interfaces to minimize unintended inter- It uses parallel and distributed computing actions and unseen dependencies, a common hin- techniques for scalability. We also pro- drance to extensibility in large projects. Where vide a demonstration outline for illustrat- there is a clear point of departure for research, ing the toolkit’s features to potential users, a basic implementation of each interface is whether they be newcomers to the ﬁeld provided as an abstract class to minimize or power users interested in extending the work necessary for extensions. toolkit. End-to-end Cohesion: An MT pipeline con- sists of many diverse components, often designed1 Introduction by separate groups that have different ﬁle formats and interaction requirements. This leads to a largeLarge scale parsing-based statistical machine number of scripts for format conversion and totranslation (e.g., Chiang (2007), Quirk et al. facilitate interaction between the components, re-(2005), Galley et al. (2006), and Liu et al. (2006)) sulting in untenable and non-portable projects, andhas made remarkable progress in the last few hindering repeatability of experiments. Joshua, onyears. However, most of the systems mentioned the other hand, integrates the critical componentsabove employ tailor-made, dedicated software that of an MT pipeline seamlessly. Still, each compo-is not open source. This results in a high barrier nent can be used as a stand-alone tool that does notto entry for other researchers, and makes experi- rely on the rest of the toolkit.ments difﬁcult to duplicate and compare. In thispaper, we describe Joshua, a Java-based general- Scalability: Joshua, especially the decoder, ispurpose open source toolkit for parsing-based ma- scalable to large models and data sets. For ex-chine translation, serving the same role as Moses ample, the parsing and pruning algorithms are im-(Koehn et al., 2007) does for regular phrase-based plemented with dynamic programming strategiesmachine translat ...