NEWBORN SCREENING
Beyond Newborn Screening: A Fellow’s Bona Fide
Approach to COVID-19 with Machine Learning
By Bryce Asay, PhD, Newborn Screening Bioinformatics and Data Analytics Fellow, Utah Public Health Laboratory and Hiral Desai, MS, specialist,
bioinformatics, Newborn Screening & Genetics
With the emergence of COVID-19,
challenges on a scale not seen since the
1918 H1N1 influenza pandemic have
impacted the world. 1 But many fields
within the scientific community have
come together to develop solutions
ranging from vaccine development to
contact tracing.
At the start of the pandemic, several
members of the data science community
gathered to assist with COVID-19
bioinformatics research. After analyzing
publicly available resources, it was
determined that there is a general lack of
curated data to be utilized for machine
learning to tackle issues such as COVID-19
vaccine targets. Machine learning allows
a computer algorithm to analyze and
make data-driven recommendations from
an immense amount of data. A solution
is to develop a viral vaccine dataset that
combines information from a variety of
different databases that have been created
and carefully vetted by infectious disease
experts and bioinformaticians.
Datasets used in machine learning require
careful curation of relevant information
in a format that can be easily accessible
by the algorithms. The curation of the
dataset can impact infectious disease
research by accelerating the discovery of
new antibiotics and novel vaccine targets. 2
Even though it is a time-consuming
and tedious task, it is one of the most
important factors in determining the
success of training predictive machine
learning algorithms. 3
The Goal
The purpose of the open source machine
learning dataset is for researchers to
have access to vaccine machine learning
modeling from a dataset that is in the
correct format, is accurate and contains
only the most relevant information.
The Team
Due to the interdisciplinary nature
of the project, the expertise of
virologists, machine learning engineers,
mathematicians, bioinformaticians,
immunologists, data scientists and
software engineers is required. After
discussing the project idea, individuals
from APHL, Colorado State University
and the computer science industry came
forward to create a group of over 20
volunteers.
The Approach
Individuals with a computer science
background will extract information—also
called “web scraping”—from web sources
like the Vaccine Investigation and Online
Network (VIOLIN), 4 the Vaccine Adverse
Event Reporting System (VAERS) 5 and
PubMed. 6 The retrieved data would be
verified by virologists and immunologists
to prevent false data from being entered.
The dataset would include but is not
limited to target pathogen, vaccine
antigen (e.g., protein sequence), efficacy
data when available, type (e.g., conjugate)
and vaccine name. Once the dataset has
been generated, it will be made freely
available to the scientific community
with the goal of quickly beginning
model generation for potential vaccine
candidates.
Challenges
As the number of volunteers keeps
growing, funding to purchase a domain
to host central communications could be
helpful in driving this project forward.
For now, the collaboration hub Slack has
been the mode of communication and
collaboration between team members.
In addition, obtaining data involves
generating Freedom of Information Act
During his PhD, Asay studied Mycobacterium
tuberculosis using computational modeling
and machine learning algorithms before
making the transition to newborn screening.
He is currently completing his fellowship in
newborn screening bioinformatics with the
Utah Public Health Laboratory.
(FOIA) requests, which can take a long
time to process. Lastly, cloud computing
resources would allow the development of
machine learning tools to allow biologists
to train their own algorithms and
visualize the data.
Next Steps
As of publication, the team has web
scraped and cleaned all publicly available
resources and is waiting to hear back
on FOIA requests for access to federal
databases.
Machine learning is an amazing tool, but
it requires an extensive amount of data
to generate the algorithms. Databases
provide a valuable resource to machine
learning researchers to collect and explore
important information. Data scientists
who may not be as familiar with the
source material may have a difficult
time parsing out which data points are
the most relevant to their analysis. In
particular, machine learning vaccine
datasets can be applied to not just the
COVID-19 pandemic but also to other
healthcare fields. •
References
1. History of 1918 Flu Pandemic | Pandemic
Influenza (Flu) | CDC [Internet]. [cited 2020 June
1] Available from: https://www.cdc.gov/flu/
pandemic-resources/1918-commemoration/1918-
pandemic-history.htm (2019).
2. HealthITAnalytics. New Initiative Uses Artificial
Intelligence for Vaccine Development.
HealthITAnalytics [Internet] [cited 2020 June 1]
Available from: https://healthitanalytics.com/news/
new-initiative-uses-artificial-intelligence-for-vaccinedevelopment
(2020).
3. Hemedan A. A. et al. Prediction of the Vaccine-derived
Poliovirus Outbreak Incidence: A Hybrid Machine
Learning Approach. Sci. Rep. 10, 5058 (2020).
4. He Y. et al. Updates on the web-based VIOLIN vaccine
database and analysis system. Nucleic Acids Res. 42,
D1124–D1132 (2014).
5. Vaccine Adverse Event Reporting System (VAERS)
[Internet]. Available from: https://vaers.hhs.gov/.
6. Coronavirus (COVID-19) [Internet]. National Institutes
of Health (NIH). [cited 2020 June 1] Available from:
https://www.nih.gov/coronavirus (2020).
26 LAB MATTERS Summer 2020
PublicHealthLabs
@APHL APHL.org