Two Methods of analyse DNA sequences: Predicting coding regions and Clustering homologous DNA
With the exponential growth of DNA sequences in the past twenty years, it has became ineffective to analyze DNA sequences only through the traditional biological experiments. Various mathematical methods and computer algorithms are applied to sequence analyses and related research areas, which help the biological study to be upgraded into automatic programming from manual operation. Especially, there are two important research areas to study DNA sequences in bioinformatics. One is to predict the coding regions on DNA sequences, another is to determine the evolutionary relationship based on DNA sequences. In this thesis, two mathematical methods are introduced to show our achievements in these two research areas respectively. In chapter two, we introduce a simple parameter called TICOR (Threshold to Identify Coding Region) to distinguish the coding regions from non-coding regions. The method only takes the linear computation time which is much better than those of Fourier Transform and other methods. Moreover, we are able to estimate the proportion of coding regions to the length of the whole DNA sequence simply basing on the parameter TICOR. Finally, we develop a novel method to predict the coding regions from DNA sequences, which we call TICORSCAN. We do the test on the ROSETTA dataset with our TICORSCAN method and other popular method, such as GENSCAN and TWINSCAN. The prediction accuracy shows that our TICORSCAN method is able to predict the coding regions more efficiently. Secondly, we report a novel mathematical method to transform the DNA sequences into the distribution vectors in chapter three. The distribution vectors correspond to points in the sixty dimensional Euclidean space. Each component of the distribution vectors represents the distribution of one kind of nucleotide in k segments of the DNA sequence. The statistical properties of the distribution vectors are demonstrated and examined with huge datasets of human DNA sequences and random sequences. The determined expectation and standard deviation can make the mapping stable and practicable. Moreover, we apply the distribution vectors to the clustering of the mitochondrial complete genomes from 80 placental mammals and the gene Haemagglutinin (HA) of 60 H1N1 viruses from Human, Swine and Avian. The 80 mammals and 60 H1N1 viruses are classified accurately and rapidly compared to the multiple sequence alignment methods. The results indicate that the distribution vectors can reveal the similarity and evolutionary relationship among homologous DNA sequences based on the distances between any two of these distribution vectors. The advantage of fast computation offers the distribution vectors the opportunity to deal with the huge amount of DNA sequences efficiently.