Machine Learning Across Multiple Imaging and Biomarker Modalities in the UK Biobank Improves Genetic Discovery for Liver Fat Accumulation


Metabolic Dysfunction-Associated Steatotic Liver Disease (MASLD), a condition where the liver contains more than 5.5% fat, is a major risk factor for chronic liver disease, affecting an estimated 30% of people worldwide. Although MASLD is a genetically complex disease, large- scale case-control cohort studies based on MASLD diagnosis have shown only limited success in discovering genes responsible for MASLD. This is largely due to the challenges in accurately and efficiently measuring the disease characteristics, which is often expensive, time-consuming, and inconsistent.

In this study, we showcase the power of machine learning (ML) in addressing these challenges. We used ML to predict the amount of fat in the liver using three different types of data from the UK Biobank: body composition data from dual-energy X-ray absorptiometry (DXA), plasma metabolites, and a combination of anthropometric and blood-based biochemical markers (biomarkers). For DXA-based predictions, we used deep learning models, specifically EfficientNet-B0, to predict fat content from DXA scans. For predictions based on metabolites and biomarkers, we used a gradient boosting model, XGBoost. Our ML models estimated that up to 29% of participants in the UK Biobank met the criteria for MASLD, while less than 10% received the clinical diagnosis. We then used these estimates to identify regions of the genome associated with liver fat, finding a total of 321 unique regions, including 312 new ones, significantly expanding our understanding of the genetic determinants of liver fat accumulation. Our ML-based genetic findings showed a high genetic correlation with clinically diagnosed MASLD, suggesting that the genetic regions we identified are also likely to be relevant for understanding and diagnosing the disease in a clinical setting. This strong correlation underscores the potential of our approach to contribute to real-world medical applications. Our findings highlight the value of ML in identifying disease-related genes and predicting disease risk, demonstrating its potential to enhance our understanding of complex diseases like MASLD. This study highlights the potential of data science to help transform healthcare research and improve patient outcomes.


Sumit Mukherjee is a Staff Machine Learning Scientist at insitro. He holds a Ph.D. in Electrical & Computer Engineering from the University of Washington. At insitro, he is involved in the development of machine learning models to derive disease-relevant traits from clinical data and developing tools to evaluate the utility of such traits for drug discovery. Previously, he was a Senior Applied Scientist at Microsoft's AI for Good Research Lab, where he developed novel generative AI tools to enable privacy-preserving data sharing in healthcare.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google