Computational Techniques in Biological Sequence Analysis

Project Guidelines

The course project is an individual research project that will account for 25% of the final grade. The project consists of:

Project Proposal (at most 2 pages) – due February 24 (11:59 PM)
Final Paper (at most 8 pages, in a conference/journal format of your choice) – due April 17 (11:59 PM)

Finding Relevant Literature

To search for research papers, use Google Scholar. Most papers available online can be found there, and the majority of links are freely accessible. If you encounter paywalls, you should be able to access the papers through the University’s library website or while connected to the University’s network.

Project Options

Students may choose from the following five research directions, all within the intersection of Computational Biology and Deep Learning:

Option A: Literature Survey

Pick a computational biology problem that interests you.
Conduct a survey of deep learning approaches to tackle this problem.
Compare and contrast different methods, discussing their strengths and weaknesses.
Example: Reviewing Transformer-based models for DNA sequence analysis.

Proposal Requirements:
- Clearly define the problem.
- Cite 8 to 12 relevant papers you plan to review.
Final Paper Structure:
1. Introduction – Define the problem and its significance.
2. Survey – Summarize the literature and compare techniques.
3. Analysis – Discuss open problems and current research gaps.
4. Conclusion – Summarize insights and propose future directions.

Option B: Empirical Evaluation

Pick a computational biology problem that interests you.
Implement and experiment with several deep learning techniques.
Compare models based on performance, computational efficiency, and generalization.
Example: Comparing Graph Neural Networks vs. CNNs for protein structure prediction.

Proposal Requirements:
- Clearly define the problem.
- Identify which deep learning methods you plan to experiment with.
- Cite 4 to 8 related papers.
Final Paper Structure:
1. Introduction – Define the problem and its importance.
2. Review of Prior Work – Summarize related studies.
3. Experimental Setup – Explain datasets, models, and evaluation metrics.
4. Results & Analysis – Compare techniques and their effectiveness.
5. Conclusion – Identify the best-performing approach and suggest improvements.

Option C: Algorithm Design

Identify a problem where existing deep learning approaches are insufficient.
Develop a new method or modification to tackle the problem.
Provide theoretical and/or empirical justification for your method.
Example: Designing a self-supervised pretraining task for genomic sequence embeddings.

Proposal Requirements:
- Define the problem and explain why current methods fail.
- Describe the intuition behind your new technique.
- Cite 4 to 8 relevant papers.
Final Paper Structure:
1. Introduction – Define the problem and explain why existing methods fall short.
2. Background – Summarize related techniques.
3. Proposed Approach – Detail your new model or method.
4. Evaluation – Compare performance against existing methods.
5. Conclusion – Summarize findings and discuss open challenges.

Option D: Dataset/Simulator/Benchmark Design

Identify a problem where datasets, benchmarks, or simulators are lacking.
Collect a dataset, build a simulator, or create a benchmark for evaluation.
Example: Developing a new dataset for metagenomic binning with taxonomic labels.

Proposal Requirements:
- Define the problem and its significance.
- Describe the dataset/simulator/benchmark you plan to create.
- Cite 4 to 8 relevant papers.
Final Paper Structure:
1. Introduction – Explain why current datasets/benchmarks are insufficient.
2. Proposed Dataset/Benchmark – Describe its key properties.
3. Evaluation – Show results using baseline models.
4. Conclusion – Discuss key insights and future improvements.

Option E: Theoretical Analysis

Identify a problem or deep learning technique that lacks theoretical understanding.
Conduct a mathematical analysis of its properties (e.g., generalization bounds, optimization stability).
Example: Studying the expressive power of Transformer models for DNA sequences.

Final Paper Guidelines

Format: Use a conference/journal format of your choice (e.g.,ICML, PLos ONE, Bioinformatics, etc).
Length: Maximum 8 pages (excluding references).
Submission: Upload electronically via LEARN by April 17 (11:59 PM).

Recommended Report Structure

Abstract (max 500 words) – Summarize the problem, key findings, and impact.
Background (max 1000 words) – Review prior work and state how your research differs.
Methods & Experiments – Explain datasets, techniques, and evaluation setup.
Results & Discussion – Present findings, analyze results, and suggest improvements.
Future Work – Describe next steps and publication plans.
Comparison with Proposal – Reflect on deviations from the initial plan.
Personal Reflection – Share lessons learned and challenges encountered.

Potential Research Topics at the Intersection of Computational Biology & Deep Learning

Genomics & Sequence Analysis

Self-supervised learning for DNA sequence representations
Transformer-based models for genomic sequence analysis
Efficient BPE tokenization strategies for genomic data

Structural & Functional Biology

Protein structure prediction using deep learning
Graph Neural Networks for protein-protein interactions
Drug-target interaction prediction with deep learning

Deep learning for microscopy image analysis
Multi-modal fusion of genomic and imaging data
Generative models for synthetic biomedical images

Evolution & Metagenomics

Deep clustering for metagenomic species binning
Contrastive learning for evolutionary relationship discovery
Variational Autoencoders for phylogenetic inference

Health & Disease Prediction

Deep survival analysis for genetic diseases
Personalized medicine using genomic deep learning
ML for microbiome-based health predictions

Choose a problem that excites you, and allow you to explore the literature, and contribute meaningfully to Computational Biology and Deep Learning (I am biased towards ML applications, but I am open to students doing projects at the intersection of CompBio and othr areas of CS such as HCI, Algorithms and Complexity and Computer Graphics)

For any questions, reach out during office hours or via email.