All of Us: creating the largest genomic dataset

An Everyday DNA blog article

Written by: Sarah Sharman, PhD, Science writer
Illustrated by: Cathleen Shaw

For my birthday a few years ago, I visited New York City for the first time. As I marveled at the vast skyscrapers interspersed with beautiful historic buildings, I couldn’t help but also notice the boundless diversity of people surrounding me. People from all walks of life, from different ethnic backgrounds, and many age generations were all gathered in the same space, admiring the same buildings as me.

The United States is often referred to as a ‘melting pot’ or ‘salad bowl’ to describe the many immigrants that came to our country and continue diversifying it today. When I entered the science field, I naively assumed that our genetic databases reflected this diversity. However, I learned this was sorely incorrect. Genome-wide associate studies (GWAS) primarily contain information from people of European descent even though they only reflect a fraction of our world’s population.

This lack of diversity in genomic data has consequences for everyone because we lack a complete picture of human health if everyone is not included. Also, critical health-related discoveries and breakthroughs could be ineffective or even unsafe for people not of European descent. Many research initiatives are trying to increase the diversity in genetic research programs and data sets. One such program, called All of Us, was launched by the National Institutes of Health (NIH) in 2015 to enroll and collect health data from an ambitious number of Americans. Let’s learn how programs like All of Us are helping to ensure that everyone in our diverse country can benefit from the promise of genomic-informed healthcare.

The road to one million participants

Historically, health care has been approached from a one size fits all mentality. Treatments are developed based on the anticipated response of an ‘average’ patient, yet expected to benefit our entire diverse population. Unfortunately for many patients, this strategy doesn’t work, with drugs either flat out not working for them or causing disrupting side effects. The more experts learn about human diseases, disorders, and conditions, the more they realize that where we live, how we live, where we come from, and our genetic makeup can all influence our health and our risk of developing diseases.

The All of Us Research Program will create one of the largest, most diverse public resources for biomedical research in human history. It began enrolling participants in May 2018 with the ultimate goal of enrolling more than one million Americans during the program. The objective is for the participants to reflect the rich diversity of America, including individuals of many races and ethnicities, age groups, geographic regions, gender identities, and health statuses. Increased diversity in research data will help accelerate medical breakthroughs that personalize prevention, treatment, and care for all Americans and beyond.

Individuals wishing to participate in All of Us enroll and submit much of their data through an online portal. While registration can be done from anywhere with a computer or tablet, All of Us has brick-and-mortar enrollment sites where people without access to a personal computer can complete their registration. Once they give their consent to participate in the program, individuals are asked to share their electronic health records and answer several health surveys related to basic health history, information about access to healthcare, and more.

Some participants will be invited to visit one of the All of Us partner centers to have physical measurements like height, weight, and blood pressure taken, and some even share their Fitbit fitness data. These measurements give researchers information on different health statuses like obesity and heart disease, for example. Samples of blood and urine are also collected at the partner center so that scientists can analyze them for biomarkers of disease, as well as DNA sequencing. From these samples, researchers can look at things that naturally occur in our bodies – like cholesterol – and also look at external factors that affect health like environmental toxins and drugs.

Combining all of the information from health history, survey responses, physical measurements, blood and urine biomarkers, and DNA sequencing helps researchers gain a complete picture of an individual’s health status. By analyzing these measurements in hundreds of thousands of individuals, researchers hope to spot patterns in what makes people healthy or sick.

All of Us participants have full access to the information that the program gains from them, including results from the DNA sequencing portion. From their DNA, participants might gain information about genetic ancestry, whether they have a greater risk of developing certain hereditary diseases or health conditions, and how their bodies will react to certain medications.

The initial plan for the program spans ten years, and participants may be asked to update their health and lifestyle information from time to time over the years. Working with participants over the long term means the program can gather more information that will help researchers find out how health and disease change over time.

**How is All of Us unique?**

There have been other large-scale research programs over the past few decades, but they pale compared to the breadth of data that the All of Us program will collect. The All of Us dataset will be the most diverse dataset, both in terms of the individuals included and also the types of data collected from each participant.

Another unique feature of All of Us is the focus on being participant-centered. Participants are invited to sit on the governance and help decide which direction the program should go. They’re allowed to participate and brainstorm about the science and the health conditions that matter to their communities. Since all of the data goes back to the participant, it allows them to be an active voice in their health journey.

The datasets and samples collected via All of Us will help researchers save time and resources to accelerate their scientific breakthroughs. Although biomedical research has improved leaps and bounds thanks to advances in technology, researchers still spend a great deal of time building up IT structure, computing protocols, and data security policies for the analysis and storage of large datasets. They also invest countless hours recruiting participants for such studies. All of Us will hand scientists the largest, most diverse cohort in the world in a data format that has been cleaned and curated for ease of use.

Privacy and security are baked into the core values of All of Us. Many underrepresented groups have good reasons to be suspicious of research because of past breaches of trust propagated by government programs, like the Tuskegee syphilis study and the Havasupai genetic study. Building trust with communities is an essential factor for All of Us, considering they hope to recruit one million diverse Americans. Accordingly, All of Us enlisted the help of privacy, security, and ethics experts to ensure clear privacy and security principles for the program.

Diverse data to benefit us all

In March 2022, All of Us released its first dataset, making nearly 100,000 highly diverse whole genome sequences available through the All of Us Researcher Workbench. About half of the data is from individuals who identify with racial or ethnic groups that have historically been underrepresented in research. In addition to the genomic data, researchers can also access electronic health records and survey results through the Workbench, as well as information about the communities where participants live.

By taking into account individual differences in lifestyle, socioeconomics, environment, and biology, researchers using the All of Us dataset will be enabled to address yet unanswerable questions about health and disease, leading to breakthroughs and advancing discoveries to reduce persistent health disparities. This includes the discovery of environmental, genetic, biochemical, and other factors predictive of disease risk, response to therapy, and disease outcomes.

All of Us is also committed to using the most cutting-edge techniques to collect and analyze the program’s data. One example of this is using long-read sequencing technology to sequence DNA samples. The first set of long-read sequencing data is made possible by the HudsonAlpha Institute for Biotechnology in collaboration with Discovery Life Sciences (DLS), a company located on the HudsonAlpha campus. HudsonAlpha Discovery, a division of DLS, began performing long-read sequencing work with the All of Us program in 2019 under the direction of HudsonAlpha faculty investigator Dr. Shawn Levy. After an initial pilot study to select a long-read sequencing platform, HudsonAlpha is on track to complete a long-read genome for more than 2,000 All of Us participants.

The samples are being processed by the DLS lab from DNA quality control through PacBio circular consensus sequencing (CCS), and HudsonAlpha researchers are performing primary analysis for quality control. As this is one of the first and largest long-read sequencing projects of its kind, the information gained from prepping and sequencing such a large cohort is invaluable to the All of Us program. The resulting protocols have been shared with other All of Us sequencing centers to enrich the current knowledge available for PacBio CCS sequencing and to prepare other genome sequencing centers to begin processing more All of Us samples. HudsonAlpha will deliver the resulting sequencing data to the bioinformatics team at The Broad Institute for analysis, alignment, and eventually release into the All of Us Researcher Workbench, with the first release being made available this November.