Objective: This study investigates the impact of class imbalance on the performance of Cox-based survival models, an important issue in clinical research where event rates (e.g., death or disease recurrence) are typically low. Unlike previous studies that apply resampling techniques to correct imbalance, we preserved the original data structure to evaluate model robustness under realistic conditions. Material and Methods: Six modeling approaches were compared: Cox proportional hazards model, the weighted Cox model, 3 regularized Cox models [least absolute shrinkage and selection operator (LASSO-Cox), Ridge-Cox, Elastic Net-Cox], and a Bayesian Cox model. Simulations were conducted across varying sample sizes (n= 50, 100, 250 ve 500) and imbalance ratios (r= 0.1, 0.2, 0.3, 0.4 and 0.5) to evaluate each model's statistical power and estimation accuracy. Results: The Bayesian Cox model consistently achieved the highest statistical power and estimation precision under conditions of severe imbalance and small sample sizes. However, its advantage diminished as sample size increased, with its power converging to that of the Cox model. Among regularized Cox models, Ridge-Cox regression demonstrated the most stable estimates, producing narrower confidence intervals than LASSO-Cox and Elastic Net-Cox. In contrast, the weighted Cox model consistently underperformed, showing lower power and unstable estimates across all scenarios. Conclusion: These findings emphasize the importance of selecting modeling strategies. In scenarios with few observed events, it is generally more effective to apply model-based adjustments rather than altering the original data distribution, which may distort event prevalence and compromise generalizability. The performance of Cox-based models improves as sample size increases; however, in small-sample, high-imbalance settings, the use of inherently robust models becomes more critical.
Keywords: Survival analysis; class imbalance; Cox models; simulation
Amaç: Bu çalışma, ilgilenilen bir olayın (örneğin ölüm veya hastalığın nüks etmesi) görülme oranının genellikle düşük olduğu klinik araştırmalarda yaygın bir sorun olan grup dengesizliğinin Cox tabanlı sağkalım modellerinin performansı üzerindeki etkisini incelemektedir. Önceki çalışmaların aksine bu çalışmada, dengesizliği gidermek için yeniden örnekleme (resampling) yöntemleri uygulanmamış, bunun yerine orijinal veri yapısı korunarak modellerin gerçekçi koşullarda ne kadar dayanıklı olduğu değerlendirilmiştir. Gereç ve Yöntemler: Çalışmada, 6 Cox-tabanlı modelin performansı karşılaştırılmıştır: Cox orantılı hazard modeli, ağırlıklı (weighted) Cox model, Cezalandırılmış Cox regresyon modelleri [en az mutlak küçülme ve seçim operatörü (least absolute shrinkage and selection operator ''LASSO-Cox''), Ridge-Cox ve Elastic Net-Cox] ve Bayesçi Cox modeli. Farklı örneklem büyüklüğü (n=50, 100, 250, 500) ve dengesizlik oranı (r=0.1, 0.2, 0.3, 0.4, 0.5) dikkate alınarak bir benzetim çalışması yapılmıştır. Elde edilen sonuçlar ile her modelin istatistiksel gücü ve tahmin doğruluğu değerlendirilmiştir. Bulgular: Bayesçi Cox modeli, özellikle ciddi dengesizlik ve küçük örneklem durumlarında, istatistiksel güç ve tahmin doğruluğu açısından en iyi performansı göstermiştir. Ancak örneklem büyüklüğü arttıkça, klasik Cox modeli ile benzer performanslar göstermeye başlamıştır. Cezalandırılmış Cox modeller arasında Ridge regresyonu en kararlı kestirimleri sağlamış ve LASSO ile Elastic Net-Cox ile karşılaştırıldığında daha dar güven aralıkları elde edilmiştir. Buna karşılık, ağırlıklı Cox modeli tüm senaryolarda en zayıf performansı gösteren yöntem olmuştur. Sonuç: Bu çalışma, modelleme stratejilerinin seçilmesinin önemini vurgulamaktadır. İlgili olayın daha az gözleme sahip olduğu durumda, orijinal veri dağılımını değiştirmek, olayın prevelansının değişmesine ve genellenebilirliğin azalmasına neden olacağı için bu durumda model tabanlı yöntemlerin kullanılması önerilir. Örneklem büyüklüğü arttıkça Cox tabanlı modellerin gücü artar, ancak, küçük gözleme ve yüksek dengesizlik oranına sahip verilerde sağlam modeller kullanılmalıdır.
Anahtar Kelimeler: Sağkalım analizi; sınıf dengesizliği; Cox modeller; benzetim
- Kleinbaum DG, Klein M. Survival analysis: a self-learning text. In: Gail M, Krickeberg K, Samet JM, Tsiatis A, Wong W, eds. 2nd ed. USA: Springer; 2005. p.1-590. [Crossref]
- Harrell FE. Regression Modeling Strategies. 2nd ed. Switzerland: Springer; 2015. p.1-582.
- Afrin K, Illangovan G, Srivatsa SS, Bukkapatnam ST. Balanced random survival forests for extremely unbalanced, right censored data. arXiv. 2018. [Link]
- Zhang HH, Lu W. Adaptive-LASSO for Cox?s proportional hazards model. Biometrika. 2007;94(3):691-703. [Crossref] [PubMed]
- Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385-95. [PubMed]
- Park MY, Hastie T. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society B. 2007;69(4):659-77. [Crossref] [PubMed]
- Ibrahim JG, Chen MH, Sinha D. Bayesian Survival Analysis. 1st ed. New York: Springer; 2001. p.1-478. [Crossref] [PubMed]
- Işık H, Karasoy D, Karabey U. A new adjusted Bayesian method in Cox regression model with covariate subject to measurement error. Hacet. J. Math. Stat. 2023;52(5):1367-78. [Link]
- Chia CC, Rubinfeld I, Scirica BM, McMillan S, Gurm HS, Syed Z. Looking beyond historical patient outcomes to improve clinical models. Sci Transl Med. 2012;4(131):131ra49. [Crossref] [PubMed]
- Lian J, Huang F, Huang X, Lau KY, Ng KS, Chu CCF, et al. Admission blood tests predicting survival of SARS-CoV-2 infected patients: a practical implementation of graph convolution network in imbalance dataset. BMC Infect Dis. 2024;24(1):803. [Crossref] [PubMed] [PMC]
- Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. The Annals of Applied Statistics. 2008;2(3):841-60. [Crossref] [PubMed]
- Lyashevska O, Malone F, MacCarthy E, Fiehler J, Buhk JH, Morris L. Class imbalance in gradient boosting classification algorithms: application to experimental stroke data. Stat Methods Med Res. 2021;30(3):916-25. [Crossref] [PubMed]
- Drummond C, Holte RC. Class Imbalance and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling. 2003. [Link]
- He H, Garcia EA. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering. 2009;21(9):1263-84. [Crossref] [PubMed]
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002;16:321-57. [Link]
- Datta G, Alexander LE, Hinterberg MA, Hagar Y. Balanced Event Prediction Through Sampled Survival Analysis. 2019;2(1):28-38. [Crossref]
- Andishgar A, Bazmi S, Lankarani KB, Taghavi SA, Imanieh MH, Sivandzadeh G, et al. Comparison of time-to-event machine learning models in predicting biliary complication and mortality rate in liver transplant patients. Sci Rep. 2025;15(1):4768. [Crossref] [PubMed] [PMC]
- Mulugeta G, Zewotir T, Tegegne AS, Muleta MB, Juhar LH. Developing clinical prognostic models to predict graft survival after renal transplantation: comparison of statistical and machine learning models. BMC Med Inform Decis Mak. 2025;25(1):54. [Crossref] [PubMed] [PMC]
- Tsiatis AA. Semiparametric Theory and Missing Data. 1st ed. New York: Springer; 2006. [Crossref]
- Willems SJW. Inverse probability censoring weights for routine outcome monitoring data [Master thesis]. The Netherlands: Leiden University Medical Center; 2014. [Link]
- Liu L, Yang F, Fan Y, Kao C, Wang F, Yu L, et al An Improved Training algorithm based on ensemble penalized Cox regression for predicting absolute cancer risk. China CDC Wkly. 2023;5(9):206-12. [Crossref] [PubMed] [PMC]
- Brilleman SL, Wolfe R, Moreno-Betancur M, Crowther MJ. Simulating survival data using the simsurv R package. Journal of Statistical Software. 2021;97(3):1-27. [Crossref]
- Therneau T. A Package for Survival Analysis in R. R package version 3.8-3. 2024. [Link]
- Gerds TA. Pec: prediction error curves for risk prediction models in survival analysis. R package version 2023.04.12. 2024. [Link]
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1-22. [Crossref] [PubMed] [PMC]
- Tay JK, Narasimhan B, Hastie T. Elastic net regularization paths for all generalized linear models. J Stat Softw. 2023;106:1. [Crossref] [PubMed] [PMC]
- Curran J. Bolstad2: bolstad functions. R package version 1.0-29. 2022. [Link]
- Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55-67. [Link]
- Verweij PJ, Van Houwelingen HC. Penalized likelihood in Cox regression. Stat Med. 1994;13(23-24):2427-36. [Crossref] [PubMed]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2005;67(2):301-20. [Crossref]
- Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox's proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1-13. [Crossref] [PubMed] [PMC]
.: Process List