← all papers Β· overview

CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

Abstract

arXiv:2502.03946v5 Announce Type: replace Abstract: Data preprocessing is often paid little attention in machine learning, despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like time-to-event models for censored data. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents CleanSurvival, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables. It builds upon Learn2Clean's Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The Python package is available on GitHub: https://github.com/datasciapps/CleanSurvival. Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing can improve predictive performance relative to simple baselines, while runtime behavior is condition-dependent and most clearly interpretable in the best-covered benchmark cells. Furthermore, a simulation study demonstrates effectiveness across different types and levels of missingness and noise. With an increase in the use of machine learning, it becomes important to generalise AutoML pipelines to a variety of models now present, including survival analysis. Tools like CleanSurvival, which integrate preprocessing for survival analysis, can make survival studies easier and quicker to perform, as well as make the results more robust.

Code

Related papers