Feature selection for indexing protein structures

Số trang: 12 Loại file: pdf Dung lượng: 2.18 MB Lượt xem: 4 Lượt tải: 0

10.10.2023

Phí lưu trữ: 3,000 VND

Xem trước 2 trang đầu tiên của tài liệu này:

Thông tin tài liệu:

Protein is composed of amino acids which are, in turn, made up of mostly carbon, hydrogen, oxygen, nitrogen. A protein structure consists of thousands of coordinates of its atoms. Building structure index tables (often organized by suffix trees or arrays) of proteins is an important phase for quickly searching or classifying protein structures.
Nội dung trích xuất từ tài liệu:
Feature selection for indexing protein structures JOURNAL OF SCIENCE OF HNUE Natural Sci., 2011, Vol. 56, No. 7, pp. 32-43FEATURE SELECTION FOR INDEXING PROTEIN STRUCTURES Luong Van Hieu Hanoi Vocational College for Electro-Mechanics Pham Tho Hoan(∗) Hanoi National University of Education (∗) E-mail: hoanpt@hnue.edu.vn Abstract. Protein is composed of amino acids which are, in turn, made up of mostly carbon, hydrogen, oxygen, nitrogen. A protein structure consists of thousands of coordinates of its atoms. Building structure index tables (often organized by suffix trees or arrays) of proteins is an important phase for quickly searching or classifying protein structures. Most previous studies use only structural features to build the index tables, therefore searching and classifying performances based on these index tables are not good enough. In this paper, we propose two methods of feature selection to create index tables that contain not only structural features but also sequantial features. Experiments on a protein classification dataset (called SCOP) showed that our proposed feature selection methods considerably improve the searching and classifying performances when compared with previous feature selection methods. Keywords: Protein structure, indexing, feature selection.1. Introduction The goal of Life Sciences is to understand the function of biological moleculessuch as the protein, DNA, RNA. While biology technologies can now easily deter-mine the sequence or 3D structure of biological molecules, it is difficult to discoverthe functions of the biological molecules. However, the structures of the sequenceproteins, especially 3D structures, may be important information to predict theirfunctions based on the previously known functions of other proteins with similarstructures. In general, the problem of structural search and comparison can be solvedthrough two main phases: firstly, extracting informative feature vectors of 3Dstructures; secondly, representing and organizing these feature vectors by some ap-propriate data structures (example: suffix tree or suffix array) for quickly search-ing/classifying them [9, 10]. Of the two phases, only the first one (i.e. extractinginformative structural features) influences the accuracy of searching/classifying re-sults.32 Feature selection for indexing protein structures In recent years many different approaches to extract and index feature vec-tors for searching/classifying protein structures have been developed. For example,ProGreSS [1] extracts feature vectors from both structure and sequence constituentsof the proteins. These feature vectors are then combined and indexed using a multi-dimensional indexing method. A better algorithm is PSIST (standing for ProteinStructure Indexing using Suffix Trees). It converts the 3D structure to a sequence(string) through some small steps: extracting the feature vector, including the dis-tance between each two residues and the angle between their planes, each componentof vector is then converted to a unique symbol. These structural features sequence(string) can be input in a suffix tree to speed up the searching process. Anotheralgorithm, PSISA [10], uses the same approach of extracting feature vectors. How-ever, instead of using the suffix tree, the PSISA use the suffix array to store thefeature sequences. The experimental results in PSISA showed that indexing usingsuffix array improves memory utilization with a factor of more than 35% over thesuffix tree as in PSIST [9]. In this paper, we propose to add other kinds of features to improve the struc-ture searching/classifying performance. We propose two methods of extracting fea-ture vectors. The first (called SFAS - Structure Feature Attached Side-Chain) fo-cuses on protein 3D structures. The second (called SSF - Structure and SequenceFeature) is a combination of tertiary structure and primary structure. Our initialexperimental results on SCOP classification showed that the proposed feature typesimprove considerably the searching/classifying performance.2. Content2.1. Methods2.1.1. The approach of protein structure indexing Protein is a macro-molecule that consists of many monomers (called aminoacids) linked together in a sequence. Amino acids are a structure of a central carbon(Cα ) linked with three constituents: an amin-group (−NH2 ), cacboxyl group (-COOH) and a tail R that characterize properties of each amino acids. There areonly 20 different kinds of amino acids found in the proteins of living organisms [5]and they are named with 20 alphabet symbols. If we only consider protein as a sequence of amino acids, we have a primarystructure. Properties (functions) of protein is often determined from its 3D struc-ture, which means the relative position of atoms (C, H, O, N) constitutes the aminoacid sequence in a any reference system. The 3D structure of protein is usuallycalled the tertiary structure. In a 3D structure of protein, there are usually threesub-structure types: Alpha-Helices (H), Beta-Strains (B), and Loops (C). If we viewthe structure of protein as a sequence of three these sub-structure types (H, E, C),we have the secondary structure of protein. For studying protein, using the 3D 33 Luong Van Hieu and Pham Tho Hoanstructure (tertiary) to analyse is always the best. However, data of 3D structure isusually very large. Figure 1. Structure of amino acid Currently there are ab ...