Algebra, statistics and computational biology

Graduate course

Dept. of Mathematics

Spring 2006

Anders Nedergaard Jensen and Niels Lauritzen





Link to the official course home page.

The human genome can be viewed mathematically as a string of around 3 billion of the letters A, C, G and T denoting the base pairs in a DNA molecule. Around 5% of the genome represents 25,000 actual genes i.e. functional words coding for proteins in the human body. In analyzing genes and similarities of DNA-sequences one usually resorts to statistical models of joint distributions of discrete random variables like hidden Markov models. Surprisingly the statistical models used can be seen as solutions of a system of highly structured polynomial equations (an algebraic variety). This framework has been coined algebraic statistics. The inference algorithms in computational biology fall under the heading of the new field tropical algebraic geometry, where the usual operations of + and ⋅ are replaced by minimum and + respectively.

Based on the book Algebraic statistics for computational biology we will go through statistical models in computational biology in the context of algebra and polyhedral geometry. The book is based on a highly successful seminar series at Berkeley in 2004-5 organized by Pachter and Sturmfels. Pachter, Sturmfels and Sullivant are organizers of a workshop based on the book at the Sophus Lie Conference Center in Nordfjordeid, Norway in June 2006. Being a math course related to biology we plan a field trip to Nordfjordeid in June.

Prerequisites

Algebra and basic statistics.

Literature

Algebraic Statistics for Computational Biology, L. Pachter, B. Sturmfels et al., Cambridge University Press 2005.