Prediction of Polygenic Risk Score by Machine Learning and Deep Learning Methods in Genome-wide Association Studies: Methodological Study

Ragıp Onur ÖZTORNACI^a , Erdal COŞGUN^b , Cemil ÇOLAK^c , Bahar TAŞDELEN^d
^aUniversity of Bristol, Department of Population Health Sciences, The MRC Integrative Epidemiology Unit, Bristol, UK
^bMicrosoft Research, USA
^cİnönü University Faculty of Medicine, Department of Biostatistics, Malatya, Türkiye
^dMersin University Faculty of Medicine, Department of Biostatistics, Mersin, Türkiye

Turkiye Klinikleri J Biostat. 2025;17(1):28-39

doi: 10.5336/biostatic.2025-108361

Article Language: EN

Full Text

ABSTRACT
Objective: We aimed to investigate whether machine learning (ML) and deep learning (DL) methods, utilizing individual-level data from genome-wide association studies (GWAS), could serve as a viable alternative to traditional polygenic risk score (PRS) calculation methods, which rely on odds ratios as weights. PRS is widely used to estimate genetic susceptibility to diseases, but its accuracy and generalizability can be affected by variations in allele frequencies and sample sizes. Given the advancements in ML and DL techniques, we explored their potential for improving risk prediction. Material and Methods: We generated GWAS datasets using the PLINK program, simulating genetic data under various conditions by varying allele frequencies and sample sizes. This process was repeated 100 times to assess the robustness of the approaches. We applied 2 ML algorithms-Support Vector Machine and Random Forest alongside a DL approach. The predictive performance of these methods was compared to the traditional PRS calculation, which uses odds ratios as weights. Results: Our findings showed that ML and DL methods provided more consistent case-control separation than the classical approach. Additionally, they exhibited reduced bias and greater stability across different genetic conditions. Conclusion: ML and DL approaches present a promising alternative to odds ratio-based PRS calculations, offering enhanced reliability and consistency in genetic risk prediction.

Keywords: Genome-wide association studies; polygenic risk score; deep learning; machine learning; precision medicine

ÖZET
Amaç: Bu çalışmada, genom-boyu ilişkilendirme çalışması [genome-wide association studies (GWAS)] verilerinden elde edilen bireysel düzey bilgileri kullanarak, poligenik risk skoru (PRS) hesaplamasında, olasılık oranlarını ağırlık olarak kullanan geleneksel yaklaşımlara alternatif olarak, makine öğrenimi [machine learning (ML)] ve derin öğrenme [deep learning (DL)] yöntemlerinin uygulanabilirliğini araştırmayı amaçladık. PRS, hastalıklara genetik yatkınlığın tahmininde yaygın olarak kullanılmaktadır; ancak, alel frekanslarındaki ve örneklem büyüklüklerindeki farklılıklar nedeniyle doğruluğu ve genellenebilirliği etkilenebilmektedir. Son yıllarda ML ve DL tekniklerindeki ilerlemeler göz önüne alındığında, bu yöntemlerin risk tahminini iyileştirip iyileştiremeyeceğini değerlendirdik. Gereç ve Yöntemler: PLINK programı kullanılarak farklı alel frekansları ve örneklem büyüklüklerinde 100 kez tekrarlanan GWAS veri setleri oluşturuldu. Ardından, bu veri setleri üzerinde 2 farklı ML algoritması (Destek Vektör Makinesi ve Rastgele orman) ile bir DL yaklaşımı uygulandı. Bu yöntemlerin performansı, olasılık oranlarını ağırlık olarak kullanan klasik PRS hesaplama yöntemiyle karşılaştırıldı. Bulgular: ML ve DL yaklaşımları, klasik yönteme kıyasla vaka-kontrol ayrımında daha tutarlı sonuçlar üretti. Ayrıca, farklı alel frekansları ve örneklem büyüklükleri altında daha az yanlılık ve daha yüksek kararlılık sergiledikleri gözlendi. Sonuç: ML ve DL tabanlı yöntemler, PRS hesaplamasında geleneksel olasılık oranına dayalı yaklaşımlara kıyasla daha güvenilir ve tutarlı bir risk tahmini sunarak alternatif bir yöntem olarak öne çıkmaktadır.

Anahtar Kelimeler: Genom-boyu ilişki çalışmaları; poligenik risk skoru; derin öğrenme; makine öğrenimi; hassas tıp

REFERENCES:

Dorak MT. Genetic Association Studies: Background, Conduct, Analysis, and Interpretation. 1st ed. New York: Garland Science; 2016. [Crossref]
Konuma T, Okada Y. Statistical genetics and polygenic risk score for precision medicine. Inflamm Regen. 2021;41(1):18. [Crossref] [PubMed] [PMC]
Choi SW, Mak TS, O?Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020;15(9):2759-72. [Crossref] [PubMed] [PMC]
Alpaydın E. Introduction to Machine Learning. 3rd ed. Cambridge: The MIT Press; 2004. [Crossref]
Akpınar H. Data Veri Madenciliği Veri Analizi. 1. Baskı. İstanbul: Papatya Yayıncılık; 2013
Gönen M, Alpaydın E. Multiple kernel learning algorithms. Journal of Machine Learning Research. 2011;12:2211-68. [Link]
Köse T, Özgür S, Coşgun E, Keskinoğlu A, Keskinoğlu P. Effect of missing data imputation on deep learning prediction performance for vesicoureteral reflux and recurrent urinary tract infection clinical study. Biomed Res Int. 2020;2020:1895076. [Crossref] [PubMed] [PMC]
Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems. 3rd ed. San Francisco: Elsevier; 2012.
Pisner DA, Schnyer DM. Support vector machine. Machine Learning. 2020:101-21. DOI: [Crossref]
Biau G, Scornet E. A random forest guided tour. TEST. 2016;25:197-227. [Crossref]
Orekici Temel G, Camdeviren H, Akkus Z. Diagnosing restless legs syndrome (RLS) patients with help of classification tree. Annals of Medical Research. 2021;12(2):111-7. [Link]
Strobl C, Zeileis A. Danger: High power!-exploring the statistical properties of a test for random forest variable importance. In Brito P, eds. COMPSTAT 2008-Proceedings in Computational Statistics, Vol. II. Heidelberg: Physica-Verlag; 2008. p. 59-66. [Crossref]
Liu Y, Zhao H. Variable importance-weighted Random Forests. Quant Biol. 2017;5(4):338-51. [PubMed] [PMC]
Aminanto ME, Kim K. Deep Learning in Intrusion Detection Systems: An Overview. Proceedings of the 2016 International Research Conference on Engineering and Technology; 2016 March 17-18; Hong Kong: 2016. p.1-12. [Link]
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014:1724-34. [Crossref]
Luong MT, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015:1412-21. [Crossref]
Hecht-Nielsen R. Theory of the Backpropagation Neural Network. Neural Networks for Perception Computation, Learning, and Architectures. 1992:65-93. [Crossref]
Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904-9. [Crossref] [PubMed]
Zollner S, Pritchard JK. Overcoming the winner?s curse: estimating penetrance parameters from case-control data. Am J Hum Genet. 2007;80(4):605-15. [Crossref] [PubMed] [PMC]
Sebastiani P, Timofeev N, Dworkis DA, Perls TT, Steinberg MH. Genome-wide association studies and the genetic dissection of complex traits. Am J Hematol. 2009;84(8):504-15. [PubMed] [PMC]
Bush WS, Moore JH. Chapter 11: Genome-wide association studies. PLoS Comput Biol. 2012;8(12):e1002822. [Crossref] [PubMed] [PMC]
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100(16):9440-5. [Crossref] [PubMed] [PMC]
Thomas DC, Conti DV. Commentary: the concept of ?Mendelian Randomization?. Int J Epidemiol. 2004;33(1):21-5. [PubMed]
Voight BF, Pritchard JK. Confounding from cryptic relatedness in case-control association studies. PLoS Genet. 2005;1(3):e32. [Crossref] [PubMed] [PMC]
R Core Team. R language definition. Vienna, Austria: R Foundation for Statistical Computing; 2000.
Copeland M, Soh J, Puca A, Manning M, Gollob D. Microsoft Azure: Planning, Deploying, and Managing Your Data Center in the Cloud. 1st ed. New York, USA: Apress; 2015. p. 3-26. [Crossref]
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559-75. [Crossref] [PubMed] [PMC]
Sobell MG. A practical guide to Ubuntu Linux. 4th ed. Canada; Pearson Education: 2015.
Black MH, Li S, LaDuca H, Lo MT, Chen J, Hoiness R, et al. Validation of a prostate cancer polygenic risk score. Prostate. 2020;80(15):1314-21. [Crossref] [PubMed] [PMC]
Elliott J, Bodinier B, Bond TA, Chadeau-Hyam M, Evangelou E, Moons KGM, et al. Predictive Accuracy of a Polygenic Risk Score-Enhanced Prediction Model vs a Clinical Risk Score for Coronary Artery Disease. JAMA. 2020;323(7):636-45. [Crossref] [PubMed] [PMC]
Placek K, Benatar M, Wuu J, Rampersaud E, Hennessy L, Van Deerlin VM, et al; CReATe Consortium; Chen W, Wu G, Paul Taylor J, McMillan CT. Machine learning suggests polygenic risk for cognitive dysfunction in amyotrophic lateral sclerosis. EMBO Mol Med. 2021;13(1):e12595. [Crossref] [PubMed] [PMC]
Huang S, Ji X, Cho M, Joo J, Moore J. DL-PRS: a novel deep learning approach to polygenic risk scores. Research Square. 2021. [Crossref]
Mamani NM. Applications of machine learning techniques and polygenic risk scores to genetic disease prediction. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal. 2020;9(1):5-14. [Crossref]
Paré G, Mao S, Deng WQ. A machine-learning heuristic to improve gene score prediction of polygenic traits. Scientific Reports. 2017;7(1):1-11. [Crossref]

.: Up To Date

.: Process List

Turkish English

About us Contact Us Comments

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Address

Turkocagi Caddesi No:30 06520 Balgat / ANKARA
Phone: +90 312 286 56 56
E-mail: info@turkiyeklinikleri.com

.: Manuscript Editing Department

Phone: +90 312 286 56 56/ 154 - 153
E-mail: yaziisleri@turkiyeklinikleri.com

.: English Language Redaction

Phone: +90 312 286 56 56/ 145
E-mail: tkyayindestek@turkiyeklinikleri.com

.: Marketing Sales-Project Department

Phone: +90 312 286 56 56/ 142
E-mail: reklam@turkiyeklinikleri.com

.: Subscription and Public Relations Department

Phone: +90 312 286 56 56/ 197
E-mail: abone@turkiyeklinikleri.com

.: Customer Services

Phone: +90 312 286 56 56/ 197
E-mail: satisdestek@turkiyeklinikleri.com