Finding sequencing errors on DNA coding regions: a combination of statistical and neural network methods

Hatzigeorgiou A., Sanida P., Papakostas E., Reczko M.

Synaptic Ltd., Po Box 51340, Athens 14510, Greece, E-mail:[email protected]

In this work we describe an improved method for finding sequencing errors in coding regions based only on the nucleotide sequence information, useful when no close homologue is already known. To achieve this basic assessments of coding measures are compared and the coding measure giving the best in-frame prediction on a test set of coding sequences is selected. The best results are obtained by a combination of codon usage statistics, which transforms the sequence to a frequency vector and an artificial neural network which separates the coding from the non-coding vectors. On an independent set the coding frame in the cDNA is correctly predicted for 90% of the nucleotides. The results from this prediction are processed using a dynamic programming algorithm to find the optimal assignment of frames and to locate the exact location of sequencing errors. Frameshifts of length at least 40 bases can be exactly located.

Not only the predicted frame, but the reliability of the prediction, varies along the sequence. For this reason the output of the program includes a very detailed prediction: for each nucleotide a score is given, with a high score indicating that the given nucleotide is the first one of a codon. If the prediction is "0 9 0 0 9 0 0 9" the user gets an accurate prediction for the particular frame . The sequence "0 6 3 4 7 5 0" indicates a region with very low prediction accuracy. If the prediction is "0 9 0 0 9 0 0 8 1 7 1 1 8 0 0 9 0 0 9" the probability of a sequencing error is very high.

The output contains:

The programm is available under the address : http://www.imbb.forth.gr/seqerr.html