An Encyclopaedia of Language_07

Số trang: 57 Loại file: pdf Dung lượng: 691.77 KB Lượt xem: 11 Lượt tải: 0

Jamona

Phí lưu trữ: 21,000 VND

Xem trước 6 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Tham khảo tài liệu an encyclopaedia of language_07, ngoại ngữ, nhật - pháp - hoa- others phục vụ nhu cầu học tập, nghiên cứu và làm việc hiệu quả
Nội dung trích xuất từ tài liệu:
An Encyclopaedia of Language_07336 LANGUAGE AND COMPUTATIONof observations or experimental subjects in which the members are more like each other than they are like members of otherclusters. In some types of cluster analysis, a tree-like representation shows how tighter clusters combine to form looseraggregates, until at the topmost level all the observations belong to a single cluster. A further useful technique ismultidimensional scaling, which aims to produce a pictorial representation of the relationships implicit in the (dis)similaritymatrix. In factor analysis, a large number of variables can be reduced to just a few composite variables or ‘factors’.Discussion of various types of multivariate analysis, together with accounts of linguistic studies involving the use of suchtechniques, can be found in Woods et al. (1986). The rather complex mathematics required by multivariate analysis meansthat such work is heavily dependent on the computer. A number of package programs are available for statistical analysis. Of these, almost certainly the most widely used isSPSS (Statistical Package for the Social Sciences), an extremely comprehensive suite of programs available, in various forms,for both mainframe and personal computers. An introductory guide to the system can be found in Norušis (1982), and adescription of a version for the IBM PC in Frude (1987). The package will produce graphical representations of frequencydistributions (the number of cases with particular values of certain variables), and a wide range of descriptive statistics. It willcross-tabulate data according to the values of particular variables, and perform chi-square tests of independence or association.A range of other non-parametric and parametric tests can also be requested, and multivariate analyses can be performed.Another statistical package which is useful for linguists is MINITAB (Ryan, Joiner and Ryan 1976). Although not ascomprehensive as SPSS, MINITAB is rather easier to use, and the most recent version offers a range of basic statisticalfacilities which is likely to meet the requirements of much linguistic research. Examples of SPSS and MINITAB analyses oflinguistic data can be found in Butler (1985b 155–65) and MINITAB examples also in Woods et al. (1986:309–13). Specificpackages for multivariate analysis, such as MDS(X) and CLUSTAN, are also available. 3. THE COMPUTATIONAL ANALYSIS OF NATURAL LANGUAGE: METHODS AND PROBLEMS 3.1 The textual materialText for analysis by the computer may be of various kinds, according to the application concerned. For an artificialintelligence researcher building a system which will allow users to interrogate a database, the text for analysis will consistonly of questions typed in by the user. Stylisticians and lexicographers, however, may wish to analyse large bodies of literaryor non-literary text, and those involved in machine translation are often concerned with the processing of scientific, legal orother technical material, again often in large quantities. For these and other applications the problem of getting large amountsof text into a form suitable for computational analysis is a very real one. As was pointed out in section 1.1, most textual materials have been prepared for automatic analysis by typing them in at akeyboard linked to a VDU. It is advisable to include as much information as is practically possible when encoding texts:arbitrary symbols can be used to indicate, for example, various functions of capitalisation, changes of typeface and layout, andforeign words. To facilitate retrieval of locational information during later processing, references to important units (pages,chapters, acts and scenes of a play, and so on) should be included. Many word processing programs now allow the direct entryof characters with accents and other diacritics, in languages such as French or Italian. Languages written in non-Romanscripts may need to be transliterated before coding. Increasingly, use is being made of OCR machines such as the KDEM (seesection 1.1), which will incorporate markers for font changes, though text references must be edited in during or after the inputphase. Archives of textual materials are kept at various centres, and many of the texts can be made available to researchers atminimal cost. A number of important corpora of English texts have been assembled: the Brown Corpus (Kucera and Francis1967) consists of approximately 1 million words of written American English made up of 500 text samples from a wide rangeof material published in 1961; the Lancaster-Oslo-Bergen (LOB) Corpus (see e.g. Johansson 1980) was designed as a BritishEnglish near-equivalent of the Brown Corpus, again consisting of 500 2000-word tex ...