Careers in Genomics: Computational biologist

An Everyday DNA blog article

Written by: Sarah Sharman, PhD, Science writer
Illustrated by: Cathleen Shaw

The ability to sequence and analyze an organism’s DNA has revolutionized the field of biology, allowing researchers to understand how the genome functions and how changes in the genome affect all life on earth. Careers in the fields of genetics and genomics are booming. In this new Everyday DNA blog series, Careers in Genomics, we will learn about different career paths in genetics and genomics.  

Genomic researchers create and analyze staggering amounts of data in their day-to-day work. Data from a single human genome sequence is around 200 gigabytes, or the space of 40 full-length movie DVDs. By 2025, we will need an estimated 40 exabytes to store the genome-sequence data generated worldwide. That’s almost eight billion DVDs that, if stacked up, would reach outer space.

In addition to storing all the genomic data, researchers must also figure out how to extract meaning from the linear DNA sequences composed of billions of base pairs. Computational biologists are critical members of a research team that can tackle this task using computational solutions. For our first career spotlight, let’s learn more about computational biology’s role in genomics.

What is computational biology?

Computational biology is an interdisciplinary science that utilizes computer tools, statistics, and mathematics to answer complex biological questions. The discipline covers many areas, from analyzing genomes to modeling population movement over time to predicting protein function to develop drugs. The computational methods applied to each specific biological puzzle vary widely and must be tailored by computational biologists. 

Although it has grown exponentially over the past decade, the field of computational biology is not new. The origins of computational biology began in the 1960s in the field of protein analysis.  Early computational solutions included de novo assembly on IBM computers, a programming algorithm for pairwise protein sequence alignments, and a mathematical framework for amino acid substitutions. After scientists discovered DNA encodes RNA, which encodes protein, computational biology expanded into the field of genomics. 

Today, computational biology is a critical component of genomic research, especially given how quickly genomic data is being produced. Computational biologists use computer tools and statistical analysis to test practical hypotheses using large-scale datasets, like genomic sequencing data. They develop and apply analytical methods and mathematical modeling techniques to study genomic systems. Computational biologists also use computational tools and approaches to organize, analyze and visualize data from across the genomic sciences and genomic medicine research. 

One example highlighting the power of computational approaches in genomics was sequencing the first human genome. Because the human genome is so large, it was divided into individual chromosomes during the Human Genome Project. Each chromosome was split into shorter pieces which were sequenced using Sanger sequencing. The many sequencing reads were overlapped with their neighbors and assembled into larger, contiguous stretches of DNA. 

Teams of computational biologists developed algorithms to perform efficient computational alignment and scaffolding to assemble the DNA fragments produced during Sanger sequencing. As DNA sequencing technology advances, computational biologists continually create new algorithms and computer programs to assemble the cleanest, most complete genomes. 

Sequencing and assembling a genome is only the first step in the process. For the sequence to be useful, it needs to be processed and analyzed. Computational biologists use powerful computational and statistical methods to decode the functional information hidden within the billions of DNA base pairs. Computational algorithms align and compare genome sequences to the reference sequence to identify changes, or variants, in DNA. Other computational programs compare variants with genome databases to discover information about the variant, including whether it has been linked with a disease before. 

How do you become a computational biologist?

So, computational biology sounds like the field for you. But now you’re probably wondering how to start a career in computational biology. As with many of the careers discussed in this blog series, there is no single path to becoming a computational biologist. 

Computational biologists typically need a bachelor’s degree in biology, computer science, mathematics, or a related field. Regardless of your major, success in computational biology requires a good understanding of biological systems. Taking life sciences classes in addition to computer classes is important. Some universities offer undergraduate degrees in computational biology. A master’s or doctoral degree is not usually required but can help computational biologists advance their careers or pursue a specific area of research.

As a computational biologist, much of your time will be spent designing, analyzing, and interpreting data from experiments and research. This includes using data analysis to identify patterns in biological data, such as gene expression or protein levels, or to identify anomalies in data, which can help pinpoint problems with experiments or research. Computational biologists often work in teams to solve complex problems, so effective communication is important for collaborating with other scientists and conveying the results of experiments. Computational biologists also often present their research at conferences and write articles for scientific journals.

Selecting and using appropriate software is key to becoming a successful computational biologist. This means understanding the underlying methods and algorithms to determine if they are suited for the dataset you are working with. Working knowledge of scripting and programming languages like R, Python, and Java is also paramount. 

Spotlight: Computational biologist John Lovell and GENESPACE  

John Lovell, PhD, is a computational biologist at the HudsonAlpha Institute for Biotechnology. He is the Evolutionary Analysis Group Lead at the HudsonAlpha Genome Sequencing Center (GSC). The GSC team is an expert in high-quality genome sequencing, assembly, and analysis. The team is at the cutting edge of advanced genomic sequencing technology, continually improving computational tools, assembly, and annotation. Members of the GSC team have generated and publicly released reference genomes for more than 180 plants. 

As part of his role at the GSC, Dr. Lovell performs detailed genome comparisons both within and between species, a sub-field of computational biology called “comparative genomics.” Identifying DNA sequences that have been conserved in many different organisms over evolutionary time is an important step in understanding the genome itself. Comparative genomics provides a powerful tool for understanding evolution and how living things have adapted over time. This helps build knowledge around genes and how they influence the health and survival of organisms. In the case of agricultural genomics, comparative studies let breeders connect DNA sequences to genes with known functions, which can accelerate crop improvement relative to climate change, drought, and disease. 

Although comparing closely related species is more commonplace today, comparing more distantly related organisms remains difficult. One major challenge is that genes are often lost, duplicated, or repurposed during evolution. This makes it hard to know if distantly related species use similar genes for similar roles. 

Dr. Lovell and other members of the GSC team recently released a new computational tool called GENESPACE that helps address these problems. GENESPACE is a software that links similarities between DNA sequences to the order of genes in a genome, allowing researchers to visualize and explore related DNA sequences and determine whether genes have been lost or duplicated. The team demonstrated the value of GENESPACE by looking at evolution in vertebrates and flowering plants. The software highlighted shared sequences between unique sex chromosomes in birds and mammals. It also tracked the positions of genes important in the evolution of grass crops like maize, wheat, and rice. 

GENESPACE is an important tool that can help researchers better understand the evolution of important sections of the genome and allow them to find target genes for applications like crop improvement.  The software is easy to use, even for researchers with few programming skills. It is freely available here.