A big problem for sick kids; a case study in predicting features for yet-to-be-hypothesized features for use in molecularly based biomedical research in pediatric brain tumors

Abstract: As our ability to analyze molecular underpinnings of cancer increases for less money, we are seeing advocacy for research and precise treatments for those suffering from some of the rarest and deadliest of cancers[1],[2]. As we addressed on our presentation to the ODSC East in 2016, there are significant effort by researchers in for the specific case of rare pediatric brain tumors from the The Children’s Brain Tumor Tissue Consortium (www.cbttc.org), a multi-national and multi-institutional pediatric brain tumor biobank to break down silos and share data for this severely scientifically underserved population. The CBTTC provides not only biospecimens, but ongoing, real-time temporal sequence-based clinical data to researchers by request in real time as patients are going through their treatment or even after they succumb to the disease[2]. As the resource grows, so does the human data entry burden. Today, the CBTTC includes over 2,700 research subjects and has accessioned and/or derived over 40,000 specimens of blood, tumors tissue, DNA, RNA and other biological samples from over 40 distinct brain tumor types. This rich resource fuels over 30 biomedical research projects and 31 data science projects around children’s brain tumors. Two years later, the consortium is running into another problem, the expected research projects are yet to be hypothesized and the consortium needs to be one step ahead of data requests for data science projects or there could be a backlog of researching thousands of medical records. In short, we do not know how these specimens will be used or what information will be needed in the future. We cannot predict the medical technology trends even 3 years in the future.

For this case study, we will illustrate and demonstrate that more sophisticated deep learning techniques are needed to predict what information can be used to automatically to supplement (annotate) biological samples extracted from participants with rich temporal clinical (phenotypic) data automatically pulled and associated from public insurance records rarely used in cancer research without the aid of human chart abstraction. We look at a specific completed study conducted at Children’s Hospital of Philadelphia (CHOP) and Seattle Children’s Hospital (SCH) are collaborating on a project to inform more precise molecular diagnosis of pediatric supratentorial malignant cortical brain tumors and correlate the findings with clinical outcomes including overall survival and find the features used to predict the disease of interest. Through traditional exploratory approaches, we found data to be highly variable, difficult to enumerate and cardinality of proportions that lend themselves more to text mining, but are actually discrete data fields. We will demonstrate how we reverse the logic of machine learning practice to use the predictor features as data. We hypothesize that we can come up with a machine learning algorithmic approach to automatically pulling significant data from large insurance-based medical records that are rarely used in this type of research to, at a minimum, get a head start on scientific projects proposed to the organization. This case study will illustrate the problems in medical data and why it does not lend itself to the usual machine learning techniques once the data is prepared in a way that does not lose any information that potentially could be relevant to future research.

1.Graham, C., Dawkins, H., Baynam, G., Lockmuller, H., Bushby, K., Monaco, L., … Molster, C. (2014). Current trends in biobanking for rare diseases: a review. Journal of Biorepository Science for Applied Medicine, 49. http://doi.org/10.2147/BSAM.S46707
2.David Stokes, Phd. (11/01/2018). Better Biobanking a Best Bet for Rare Disease Research | Cornerstone - CHOP Research Institute Blog. Retrieved November 1, 2018, from https://blog.research.chop.edu/guest-blog-better-biobanking-a-best-bet-for-rare-disease-research.

Bio: Alex S. Felmeister is a Supervisor of Data Integration at The Children’s Hospital of Philadelphia’s (CHOP) Department of Biomedical and Health Informatics where he and a small group of developers and data integration analysts apply new techniques with a goal of integrating complex health data elements for scientific projects usually involving imaging, genomics and massive amounts of clinical data in diverse domains. In most cases projects are centered around rare disease research where scientists need access to data as soon as it is available. He is also a doctoral candidate in Information Science at Drexel University in Philadelphia PA and will be defending his research in April of 2019. He holds a Masters in Information Systems from Drexel and currently is studying the derivation and use of a validated data driven phenotypes from large sets of observational data from Clinical Data Research Networks as annotation of rare brain tumors to reduce a mostly time consuming human process through predictive analytics. He hopes to expand this research to other rare diseases where there is an increasing need for complex data in real time to supplement the newest molecular analysis techniques all while protecting research participant privacy. Alex’s research is funded by the Mellon Foundation through a fellowship program at the Council on Library and Information Resources.