Computational Genomics

10-810 / MSCBIO2070 (co-listed as 02710, 03715), Spring 2008

Ziv Bar-Joseph, Takis Benos

School of Computer Science, Carnegie-Mellon University
Department of Computational Biology, University of Pittsburgh


Course Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set.  Projects can be done by you as an individual, or in teams of two to three students.   Each project will also be assigned a 708 instructor as a project consultant/mentor.   They will consult with you on your ideas, but the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 30% of your final class grade, and will have two final deliverables:

1.      a writeup in the form of a IEEE paper (8 pages maximum in IEEE format, including references), due April 30, worth 60% of the project grade, and

2.      a poster presenting your work for a special class session at the end of the semester, on April 30, worth 40% of the project grade. 

 

Project Proposal:
 

You must turn in a brief project proposal (1-page maximum) by Feb 20th. 

You are encouraged to come up a topic directly related to your own current research project or research topics related to graphical models of your own interest that bears a non-trivial technical component (either theoretical or application-oriented), but the proposed work must be new and should not be copied from your previous published or unpublished work. For example, research on graphical models that you did this summer does not count as a class project. 

You may use the list of available dataset provided bellow and pick a "less adventurous" project from the following list of potential project ideas.  These data sets have been successfully used for machine learning in the past, and you can compare your results with those reported in the literature. Of course you can also choose to work on a new problem beyond our list used the provided dataset. 

Project proposal format:  Proposals should be one page maximum.  Include the following information:

·         Project title

·         Project idea.  This should be approximately two paragraphs.

·         Software you will need to write.

·         Papers to read.  Include 1-3 relevant papers.  You will probably want to read at least one of them before submitting your proposal

·         Teammate(s): will you have teammate(s)?  If so, whom?  Maximum team size is three students.


Project suggestions: 

·        Ideally, you will want to pick a problem in a domain of your interest, e.g., DNA sequence analysis, genetics polymorphisms, regulatory networks, etc., and formulate your problem using a statistical machine learning formalism. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.   

You can also find some project ideas below.


 



Project E: (please contact Ziv for more details): Dynamic Bayesian networks from time series datasets.

Time series Expression data measures the levels of genes following specific treatment. For example, following pathogen infection such data can provide insight to the set of genes that are responding to the infection and to the immune response system. Using time series data we would like to learn a graphical model that represent the set of interactions that are employed as part of the response. In this project you will explore ways to use time series datasets for determining the structure and parameters of the regulatory network underlying the observed responses.



Project F: (please contact Ziv for more details): Classification using time series expression data.

It has been shown that the type of cancer, and in some cases the right treatment option can be determined by looking at the expression profile of a patient. Many famous classification algorithms have been suggested for this task including SVM, Naïve Bayes and statistical tests. More recently, measurements that follow patients over time are becoming available. This project will explore ways to develop classifiers that are appropriate for time series data.



Project G: (please contact Ziv for more details): Protein interaction networks

Recent experiments have identified many new protein-protein interactions. While the quality of this data is not great, it does serve as a useful source for integration with other available datasets. In this project you will explore the relationship between the interacting proteins and other types of high throughput data (such as expression or binding). Specifically, it is interesting to see of aspects that cannot be inferred from the current interaction data (such as pathways) can be determined by using these complementary data sources.