Computational Techniques in Biological Sequence Analysis
Project Guidelines
The course project is an individual research project that will account for 25% of the final grade. The project consists of:
- Project Proposal (at most 2 pages) – due February 24 (11:59 PM)
- Final Paper (at most 8 pages, in a conference/journal format of your choice) – due April 17 (11:59 PM)
Finding Relevant Literature
To search for research papers, use Google Scholar. Most papers available online can be found there, and the majority of links are freely accessible. If you encounter paywalls, you should be able to access the papers through the University’s library website or while connected to the University’s network.
Project Options
Students may choose from the following five research directions, all within the intersection of Computational Biology and Deep Learning:
Option A: Literature Survey
- Pick a computational biology problem that interests you.
- Conduct a survey of deep learning approaches to tackle this problem.
- Compare and contrast different methods, discussing their strengths and weaknesses.
-
Example: Reviewing Transformer-based models for DNA sequence analysis.
Proposal Requirements:
- Clearly define the problem.
- Cite 8 to 12 relevant papers you plan to review.
Final Paper Structure:
- Introduction – Define the problem and its significance.
- Survey – Summarize the literature and compare techniques.
- Analysis – Discuss open problems and current research gaps.
- Conclusion – Summarize insights and propose future directions.
Option B: Empirical Evaluation
- Pick a computational biology problem that interests you.
- Implement and experiment with several deep learning techniques.
- Compare models based on performance, computational efficiency, and generalization.
-
Example: Comparing Graph Neural Networks vs. CNNs for protein structure prediction.
Proposal Requirements:
- Clearly define the problem.
- Identify which deep learning methods you plan to experiment with.
- Cite 4 to 8 related papers.
Final Paper Structure:
- Introduction – Define the problem and its importance.
- Review of Prior Work – Summarize related studies.
- Experimental Setup – Explain datasets, models, and evaluation metrics.
- Results & Analysis – Compare techniques and their effectiveness.
- Conclusion – Identify the best-performing approach and suggest improvements.
Option C: Algorithm Design
- Identify a problem where existing deep learning approaches are insufficient.
- Develop a new method or modification to tackle the problem.
- Provide theoretical and/or empirical justification for your method.
-
Example: Designing a self-supervised pretraining task for genomic sequence embeddings.
Proposal Requirements:
- Define the problem and explain why current methods fail.
- Describe the intuition behind your new technique.
- Cite 4 to 8 relevant papers.
Final Paper Structure:
- Introduction – Define the problem and explain why existing methods fall short.
- Background – Summarize related techniques.
- Proposed Approach – Detail your new model or method.
- Evaluation – Compare performance against existing methods.
- Conclusion – Summarize findings and discuss open challenges.
Option D: Dataset/Simulator/Benchmark Design
- Identify a problem where datasets, benchmarks, or simulators are lacking.
- Collect a dataset, build a simulator, or create a benchmark for evaluation.
-
Example: Developing a new dataset for metagenomic binning with taxonomic labels.
Proposal Requirements:
- Define the problem and its significance.
- Describe the dataset/simulator/benchmark you plan to create.
- Cite 4 to 8 relevant papers.
Final Paper Structure:
- Introduction – Explain why current datasets/benchmarks are insufficient.
- Proposed Dataset/Benchmark – Describe its key properties.
- Evaluation – Show results using baseline models.
- Conclusion – Discuss key insights and future improvements.
Option E: Theoretical Analysis
- Identify a problem or deep learning technique that lacks theoretical understanding.
- Conduct a mathematical analysis of its properties (e.g., generalization bounds, optimization stability).
- Example: Studying the expressive power of Transformer models for DNA sequences.
Final Paper Guidelines
- Format: Use a conference/journal format of your choice (e.g.,ICML, PLos ONE, Bioinformatics, etc).
- Length: Maximum 8 pages (excluding references).
- Submission: Upload electronically via LEARN by April 17 (11:59 PM).
Recommended Report Structure
- Abstract (max 500 words) – Summarize the problem, key findings, and impact.
- Background (max 1000 words) – Review prior work and state how your research differs.
- Methods & Experiments – Explain datasets, techniques, and evaluation setup.
- Results & Discussion – Present findings, analyze results, and suggest improvements.
- Future Work – Describe next steps and publication plans.
- Comparison with Proposal – Reflect on deviations from the initial plan.
- Personal Reflection – Share lessons learned and challenges encountered.
Potential Research Topics at the Intersection of Computational Biology & Deep Learning
Genomics & Sequence Analysis
- Self-supervised learning for DNA sequence representations
- Transformer-based models for genomic sequence analysis
- Efficient BPE tokenization strategies for genomic data
Structural & Functional Biology
- Protein structure prediction using deep learning
- Graph Neural Networks for protein-protein interactions
- Drug-target interaction prediction with deep learning
Biomedical Image & Multi-Modal Learning
- Deep learning for microscopy image analysis
- Multi-modal fusion of genomic and imaging data
- Generative models for synthetic biomedical images
Evolution & Metagenomics
- Deep clustering for metagenomic species binning
- Contrastive learning for evolutionary relationship discovery
- Variational Autoencoders for phylogenetic inference
Health & Disease Prediction
- Deep survival analysis for genetic diseases
- Personalized medicine using genomic deep learning
- ML for microbiome-based health predictions
Choose a problem that excites you, and allow you to explore the literature, and contribute meaningfully to Computational Biology and Deep Learning (I am biased towards ML applications, but I am open to students doing projects at the intersection of CompBio and othr areas of CS such as HCI, Algorithms and Complexity and Computer Graphics)
For any questions, reach out during office hours or via email.