| 초록 |
Despite the growing use of artificial intelligence, data availability, and privacy concerns limit its clinical application. This study aimed to develop a synthetic model as a promising solution to address these, enabling the prediction of postoperative acute kidney injury (PO-AKI) prediction even with a relatively small real-world dataset. We developed a synthetic model to generate virtual patient data, incorporating comorbidities, laboratory results, medication history, surgical details, and PO-AKI occurrence in patients underwent non-cardiac major surgeries. The model was built on the BERT architecture and trained using real-world data from data-rich hospitals. Privacy risks were evaluated through Membership and Attribute Inference Attacks (MIA and AIA). The similarity between synthetic and real-world data was statistically assessed, and its clinical utility was evaluated by examining whether augmenting data-scarce scenarios with exact matched synthetic data improved PO-AKI prediction using the CatBoost. A total of 335,687 real-world patient data were collected from six tertiary hospitals, including 275,727 from 3 data-rich and 59,960 from 3 data-scarce hospitals. The similarity between the real-world data from the data-rich hospitals, which served as the training set for the synthetic generation model, and the synthetic data from each hospital was analyzed (Table 1). At SNUH, 90.4% of variables showed no statistically significant difference between real-world and synthetic data, compared to 89.0% at SNUBH and 94.4% at AMC. The MIA and AIA confirmed the privacy protection of synthesized data. The clinical utility of synthetic data in PO-AKI prediction was evaluated by augmenting real-world data-scarce cohorts (250–2,000 patients) with synthetic data. The benefit was most pronounced in smaller cohorts, peaking at 2,000–4,000 synthetic patients and plateauing beyond 16,000 (Figure 1). This is the first study to apply generative AI to PO-AKI prediction. We comprehensively demonstrate its clinical utility in data-scarce scenarios by enhancing prediction performance through synthetic data augmentation. |