Analysis of symbolic sequences from the point of view of bioinformatics

1. Introduction

1.1. Aims of the class. Relations to other algorithmic classes of SAD.
1.2. Sources of sequences: linguistics, speech recognition, bioinformatics, Internet.
1.3. General problems of sequence analysis: alignment (here and below: pairwise and multiple); scoring of sequence similarity; search for local similarities; search for the given pattern or pattern cluster; segmentation of a sequence; determination of statistical significance of comparison/search results.
1.4. Problems specific for different areas. Examples.

2. Dynamic programming on graphs.

2.1. Search for minimal path in the weighted directed acyclic graph (WDAG). Bellman equation.
2.2. Calculation of partition function. Relation to minimal path problem.
2.3. WDAGs with weights in a semiring. “Joining” operation (multiplication) and “target” operation (addition). Calculation of partition function for a WDAG with known topological ordering of vertexes. (Non)commutativity of addition and multiplication.
2.4. Special partition functions and their calculation.
2.5. Algorithm with accumulation of results. Comparison with Bellman’s and Dijkstra’s algorithms. *Optimistic search.
2.6. Complexities.
2.7. Vector weights.

3. Global pairwise sequence alignment

3.1. Informal problem statement. Role of evolution hypothesis.
3.2. Alignment weight. Match weight. Indels (gaps) and indel penalty.
3.3. Algorithm for 1-symbol indel penalty. Reduction to graph problem.
3.4. Other problem statements: affine gap penalties; general gap penalty function;
matching significance based on special partition fubction. Reduction to graph
problems. Role of gap penalties for ending fragments.
3.5. Role of etalon (“golden standard”) alignments. Quality of algorithmic alignment (accuracy and confidence). Verification of problem statement. Recourses for improvement of alignment quality, usage of extra-sequential information.

4. Dynamic programming on hypergraphs.

4.1. Examples of algorithmic problems that cannot be reduced to graphs.
4.2. Hypergrahs, hyperedges and hyperpathes. Weighted directed hypergraphs. Acyclic hypergraphs.
4.3. Semirings. Bellman’s algorithm for calculation of partition function of a given weighted directed acyclic hypergraph (WDAHG) with known topological ordering of vertexes. Complexity.
4.4. Calculation of special partition functions. Additional restrictions: commutativity of multiplication; strong aciclicity. Complexity.
4.5. Analogs of algorithm with accumulation of results and Dijkstra’s algorithms.

5. Search for local similarities.

5.1. Global alignment as a chain of local similarities. Hierarchical approach to global alignment. Area of application. Complexity.
5.2. Filtration approach to local similarity search. Seeds and seed similarities. Contiguous matching seeds. Sensitivity and selectivity of a seed w.r.t. a probability model and a set of target similarity.
5.3. Improvements of contiguous matching seeds: gapped seeds; subset seeds; vector seeds; multy-seeds.
5.4. Seed design.

6. Probability models: revisited.

6.1. Examples of probability models: Bernoulli model, Markov model, Hidden Markov model (HMM).
6.2. HMMs and graphs. Involute of HMM graph. Trajectory graph of an HMM.
6.3. Sequence of states corresponding to a given sequence. Viterbi algorithm and forward-backward algorithm.
6.4. HMMs and dynamic programming. Viterbi algorithm and optimal path in trajectory graph. Forward-backward algorithm and special partition function on trajectory graph.
6.5. How to estimate quality of state assignment made by an algorithm?
6.6. Estimation of parameters of a probability model. Problem statement.
6.7. EM-algorithm.
6.8. Baum-Welch algorithm.

7. Examples of HMM implementation

7.1. HMMs and finite automata. Computation of probability of set of sequences. General construction and examples.
7.2. Seed sensitivity.
7.3. Statistical significance of cluster of pattern entries.
7.4. Sequence segmentation. Example: finding coding regions in DNA sequences. Types of HMMs describing procariotic genome. Problem of boundaries.

8. Multiple sequence comparison.

8.1. Multiple sequence alignment. Dynamic programming without and with restriction on gap size.
8.2. Profiles (Position Specific Weight Matrixes, PSWMs) . Alignment of a sequence and a profile; alignment of profiles. Progressive alignment.
8.3. Pattern recognition. Description of patterns with PSWMs and consensus words.
8.4. Search for multiple similarities of unknown pattern. Algorithm Gibbs-sampler.
8.5. Phylogeny.
8.6. Database search. Statistical significance of search results. Karlin-Altschul theory