Lab Matters Summer 2020 | Page 28

NEWBORN SCREENING Beyond Newborn Screening: A Fellow’s Bona Fide Approach to COVID-19 with Machine Learning By Bryce Asay, PhD, Newborn Screening Bioinformatics and Data Analytics Fellow, Utah Public Health Laboratory and Hiral Desai, MS, specialist, bioinformatics, Newborn Screening & Genetics With the emergence of COVID-19, challenges on a scale not seen since the 1918 H1N1 influenza pandemic have impacted the world. 1 But many fields within the scientific community have come together to develop solutions ranging from vaccine development to contact tracing. At the start of the pandemic, several members of the data science community gathered to assist with COVID-19 bioinformatics research. After analyzing publicly available resources, it was determined that there is a general lack of curated data to be utilized for machine learning to tackle issues such as COVID-19 vaccine targets. Machine learning allows a computer algorithm to analyze and make data-driven recommendations from an immense amount of data. A solution is to develop a viral vaccine dataset that combines information from a variety of different databases that have been created and carefully vetted by infectious disease experts and bioinformaticians. Datasets used in machine learning require careful curation of relevant information in a format that can be easily accessible by the algorithms. The curation of the dataset can impact infectious disease research by accelerating the discovery of new antibiotics and novel vaccine targets. 2 Even though it is a time-consuming and tedious task, it is one of the most important factors in determining the success of training predictive machine learning algorithms. 3 The Goal The purpose of the open source machine learning dataset is for researchers to have access to vaccine machine learning modeling from a dataset that is in the correct format, is accurate and contains only the most relevant information. The Team Due to the interdisciplinary nature of the project, the expertise of virologists, machine learning engineers, mathematicians, bioinformaticians, immunologists, data scientists and software engineers is required. After discussing the project idea, individuals from APHL, Colorado State University and the computer science industry came forward to create a group of over 20 volunteers. The Approach Individuals with a computer science background will extract information—also called “web scraping”—from web sources like the Vaccine Investigation and Online Network (VIOLIN), 4 the Vaccine Adverse Event Reporting System (VAERS) 5 and PubMed. 6 The retrieved data would be verified by virologists and immunologists to prevent false data from being entered. The dataset would include but is not limited to target pathogen, vaccine antigen (e.g., protein sequence), efficacy data when available, type (e.g., conjugate) and vaccine name. Once the dataset has been generated, it will be made freely available to the scientific community with the goal of quickly beginning model generation for potential vaccine candidates. Challenges As the number of volunteers keeps growing, funding to purchase a domain to host central communications could be helpful in driving this project forward. For now, the collaboration hub Slack has been the mode of communication and collaboration between team members. In addition, obtaining data involves generating Freedom of Information Act During his PhD, Asay studied Mycobacterium tuberculosis using computational modeling and machine learning algorithms before making the transition to newborn screening. He is currently completing his fellowship in newborn screening bioinformatics with the Utah Public Health Laboratory. (FOIA) requests, which can take a long time to process. Lastly, cloud computing resources would allow the development of machine learning tools to allow biologists to train their own algorithms and visualize the data. Next Steps As of publication, the team has web scraped and cleaned all publicly available resources and is waiting to hear back on FOIA requests for access to federal databases. Machine learning is an amazing tool, but it requires an extensive amount of data to generate the algorithms. Databases provide a valuable resource to machine learning researchers to collect and explore important information. Data scientists who may not be as familiar with the source material may have a difficult time parsing out which data points are the most relevant to their analysis. In particular, machine learning vaccine datasets can be applied to not just the COVID-19 pandemic but also to other healthcare fields. • References 1. History of 1918 Flu Pandemic | Pandemic Influenza (Flu) | CDC [Internet]. [cited 2020 June 1] Available from: https://www.cdc.gov/flu/ pandemic-resources/1918-commemoration/1918- pandemic-history.htm (2019). 2. HealthITAnalytics. New Initiative Uses Artificial Intelligence for Vaccine Development. HealthITAnalytics [Internet] [cited 2020 June 1] Available from: https://healthitanalytics.com/news/ new-initiative-uses-artificial-intelligence-for-vaccinedevelopment (2020). 3. Hemedan A. A. et al. Prediction of the Vaccine-derived Poliovirus Outbreak Incidence: A Hybrid Machine Learning Approach. Sci. Rep. 10, 5058 (2020). 4. He Y. et al. Updates on the web-based VIOLIN vaccine database and analysis system. Nucleic Acids Res. 42, D1124–D1132 (2014). 5. Vaccine Adverse Event Reporting System (VAERS) [Internet]. Available from: https://vaers.hhs.gov/. 6. Coronavirus (COVID-19) [Internet]. National Institutes of Health (NIH). [cited 2020 June 1] Available from: https://www.nih.gov/coronavirus (2020). 26 LAB MATTERS Summer 2020 PublicHealthLabs @APHL APHL.org