Summary of Mathematics doctoral thesis: Bioinformatics XML documents index method based on R-Tree method

Số trang: 26 Loại file: pdf Dung lượng: 877.31 KB Lượt xem: 14 Lượt tải: 0

tailieu_vip

Hỗ trợ phí lưu trữ khi tải xuống: 26,000 VND

Xem trước 3 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Research indexing method based on R-tree method to increase the efficiency of XPath queries on XML data, through intermediate data converted into numerical coordinates of tags; use the method of converting XML structured text data into numeric data that can be represented on 2-dimensional space (can be extended to many dimensions).
Nội dung trích xuất từ tài liệu:
Summary of Mathematics doctoral thesis: Bioinformatics XML documents index method based on R-Tree method MINISTRY OF EDUCATION VIETNAM ACADEMY OF AND TRAINING SCIENCE AND TECHNOLOGY GRADUATE UNIVERSITY OF SCIENCE AND TECHNOLOGY ------------------------------- DINH DUC LUONG BIOINFORMATICS XML DOCUMENTS INDEX METHOD BASED ON R-TREE METHOD Major: Mathematical Foundations for Computer Science Code: 9 46 01 10 SUMMARY OF MATHEMATICS DOCTORAL THESIS Ha noi, 2019 List of works of author 1 Dinh Duc Luong, Hoang Do Thanh Tung, “A Survey on Indexing for Gene Database”, International Clustering Workshop: Teaching, Research, Business, December 27-29, 2014, pp. 50-54. 2 Hoang Do Thanh Tung, Dinh Duc Luong, “A proposed Indexing Method for Treefarm database”, International Conference on Information and Convergence Technology for Smart Society, Vol.2 No.1, Jan, 19-21,2016 in Ho Chi Minh, Vietnam, pp. 79-81. 3 Vuong Quang Phuong, Le Thi Thuy Giang, Dinh Duc Luong, Ngo Van Binh, Hoang Do Thanh Tung, “Technology solution of managing pig breed”, Proceedings of the XXI National Conference: Some selected issues of Information Technology and Communications, Thanh Hoa, 27-28/7/2018, pp. 110-116 4 Hoang Do Thanh Tung, Dinh Duc Luong, “An Improved Indexing Method for Xpath Queries”, Indian Journal of Science and Technology, Vol 9(31), DOI:10.17485/ijst/2016/v9i31/92731, August 2016, pp. 1-7 (SCOPUS). 5 Dinh Duc Luong, Vuong Quang Phuong, Hoang Do Thanh Tung, “A new Indexing technique XR+tree for Bioinformatic XML data compression”, International Journal of Engineering and Advanced Technology (IJEAT), ISSN: 2249-8958 (Online), Volume-8, Issue-5, June 2019, pp. 1-7 (SCOPUS). INTRODUCTION XML documents are structured text data, or semi-structured data, which has been popular for decades because data storage is flexible and easy to share and use through the internet. In the past, XML documents were not usually very large, but in recent years began to appear large bioinformatics XML documents that can reach Giga, Tera Byte because of the rapid development of biotechnology in this era. That data can be found from reputable data sources such as SRA (decoded sequences), NCBI Genome (sequenced species), ensembl.org (aggregating a lot of data into BioMart) ... Bioinformatics XML documents are two-part data, biological data (DNA, Protein, subspecies, etc.) and description data of biological data . Data structures are defined according to tags and these data structures are often flexible and may be different because they are customized by biological individuals and organizations. Because of such a large size, the basic documents must be stored and exploited on the hard disk, or in a distributed storage system, before being able to access a small portion to put on main memory (RAM) whenever further analysis is needed. Hard disk access mechanism is sequential and much time-consuming than accessing on RAM. Therefore, the query methods that need to access the hard disk always find ways to minimize the number of times to access the hard disk and maximize the use of main memory, such as Cache, Buffer. The practical queries based on the algorithm of specific queries are designed to achieve the desired results in a short time and to match the query. For example: 1. Query XPath for an XML document (exact search): extract all data with tags of the same origin / sibling of one White Mouse type or extract all data is a descendant of the African pig. 2. Homologous query for DNA data fragments (approximate search): look for all the homologous genes with a Gen sample of a new species. The traditional solution for such queries is to select and install indexing methods that suit to some certain types of data and specific queries. These have already been such methods, but these methods are limited to such large-sized text data. With text data, the size of the index data is often very large, even much larger than the original data, thus causing problems: (1) storing index data is a difficult problem. (2) Data compression and data exploitation at the same time are less efficient. Moreover, if the index is text data, the query speed problem is still a difficult problem to solve. Therefore, recent studies on indexing an XML document tend to: - Separate XML document into 2 parts of data and apply different indexing methods to suit data types and specific query types. Detailed: 1. The method of indexing structured data (tag data) and supporting specific queries such as XPath 2. Methods of indexing biological data (such as DNA fragments) and supporting specific queries such as searching for homologous DNA sequences. 1 - Converting original text data into digital format is aimed at: 1. Reduce the original data size. 2. Apply appropriate indexing methods. 3. Improve the speed of queries. The problems to be solved are broad, including informatics and biology, so the thesis's research focuses on solving the problem of indexing method to support specific queries about speed by reducing the number of queries access to the hard disk and still achieve the expected results. The results of the thesis have solved the method of indexing structured data (data of tags) and supporting XPath queries. In addition, with the problem of Biological data indexing method (such as DNA fragments) and supporting specific queries such as searching for homologous DNA sequences, the thesis has investigated the method and had orientation for further research. Objectives and resul ...