Báo cáo khoa học: Development of a Stemming Algorithm

Số trang: 10 Loại file: pdf Dung lượng: 296.12 KB Lượt xem: 13 Lượt tải: 0

tailieu_vip

Hỗ trợ phí lưu trữ khi tải xuống: 1,000 VND

Báo xấu

Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Institute of Technology, Cambridge, Massachusetts 02139A stemming algorithm, a procedure to reduce all words with the same stem to a common form, is useful in many areas of computational linguistics and information-retrieval work. While the form of the algorithm varies with its application, certain linguistic problems are common to any stemming procedure.
Nội dung trích xuất từ tài liệu:
Báo cáo khoa học: "Development of a Stemming Algorithm" [Mechanical Translation and Computational Linguistics, vol.11, nos.1 and 2, March and June 1968] Development of a Stemming Algorithm* by Julie Beth Lovins,† Electronic Systems Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 A stemming algorithm, a procedure to reduce all words with the same stem to a common form, is useful in many areas of computational lin- guistics and information-retrieval work. While the form of the algorithm varies with its application, certain linguistic problems are common to any stemming procedure. As a basis for evaluation of previous attempts to deal with these problems, this paper first discusses the theoretical and practical attributes of stemming algorithms. Then a new version of a context-sensi- tive, longest-match stemming algorithm for English is proposed; though developed for use in a library information transfer system, it is of general application. A major linguistic problem in stemming, variation in spelling of stems, is discussed in some detail and several feasible programmed so- lutions are outlined, along with sample results of one of these methods.I. Introduction variety of applications are considered in evaluating the theoretical and practical attributes of several previousA stemming algorithm is a computational procedure algorithms.which reduces all words with the same root (or, if pre- As a major part of its information transfer experi-fixes are left untouched, the same stem) to a common ments, Project Intrex [5] is developing an integrated re-form, usually by stripping each word of its derivational trieval system in which a library user, through a remoteand inflectional suffixes. Researchers in many areas of computer terminal, can first obtain extensive informa-computational linguistics and information retrieval find tion from a central digital store about documents thatthis a desirable step, but for varying reasons. In auto- are available on a specific subject, and then obtain themated morphological analysis, the root of a word may full text of the documents. A prototype retrieval systembe of less immediate interest than its suffixes, which can is being assembled in order to permit experimentationbe used as clues to grammatical structure. (See, e.g., Earl with its various components. The experimental system[2, 3] and Resnikoff and Dolby [6]. This field has also will use a specially compiled augmented library cata-been reported on by S. Silver and M. Lott, Machine logue containing information on approximately 10,000Translation Project, University of California, Berkeley documents in the field of materials science and engi-[personal communication].) At the other extreme, what neering, including not only author, title, and other basicsuffixes are found may be subsidiary to the problem of data about each document but also an abstract, bibliog-removing them consistently enough to obtain sets of raphy, and a list of subject terms indicating the contentexactly matching stems. Word-frequency counts using of the document. Each subject term is a phrase of onestems, for stylistic (as described by S. Y. Sedelow [per- or more English words. A stemming algorithm will besonal communication]) or mathematical analysis of a used to maximize the usefulness of the subject terms.body of language, often require matched stems. (So In many cases, the information which is semanticallydoes stemming as part of an information-retrieval sys- significant to the user of the system is contained in thetem, the specific application which motivated this pa- stems of the lexical words in the subject terms, andper. ) But certain linguistic problems are common to any suffix ...