AbstractDrug-tolerant persister cells (DTPs) are cancer cells that evade the effects of therapies through a reversible, non-genetic mechanism, leading to therapeutic resistance and cancer recurrence. Single cell RNA sequencing data is a promising modality to study the mechanism behind DTPs’ survival. The first step of such studies is to classify persister vs. non-persister cells. Today, bioinformaticians identify persister cells manually from RNA sequencing data; this is a tedious and time consuming task.This paper introduces the first end-to-end, fully automatic pipeline for DTPs classification using an adapted Transformer-based scGPT model. Our contributions are as follows: Persister cells are transient and reversible in nature. In our training data, even if a cell is labeled as a persister cell, it is better to inform the model that this cell has certain probability of being a persister cell and some probability of being a non-persister cell. The default one-hot cross-entropy loss of the scGPT model assumes label certainty and will lead to overfitting. To overcome this, we fine-tuned a pre-trained scGPT model using a label smoothing cross-entropy loss that encourages the model to give probabilities to both classes. Single cell data is sparse; many genes are not expressed. Non-expressed genes bring noise to the model. We leveraged statistical analysis to understand significant genes across cell lines, resulting in a threshold that our pipeline uses to only keep highly variable genes as inputs to the model. When assigning data points to different cross validation folds, the original scGPT pipeline stratifies them by only constraining on cell types. This will result in imbalanced distribution of data points from different batches in different folds, preventing the model from generalizing to unseen batches. We stratified data using both cell type and batch, mitigating the batch effect.We evaluated our pipeline on a proprietary data processed through the Persist-seq baseline scRNA-seq pipeline, with 41071 data points, coming from four non-small cell lung cancer cell lines. We performed 4-fold cross validation with each fold using data from 3 cell lines for training and data from the fourth cell line for testing. Averaging over testing folds, our model achieved 90% accuracy, with a 0.89 F1 score. Our finding suggests that transformer-based deep learning pipelines are effective alternatives to traditional manual approaches for persister cell classification. Future research will focus on developing a multi-step approach, where the initial step is to classify cancer cells vs non-cancer cells, followed by persister cell classification.Citation Format:Balaji Selvaraj, Abinaya Vairam, Pablo Andres Moreno Cortez, Ramya Tanikanti, Anca M. Farcas, Yi Wei, Martin L. Miller, Viia E. Valge-archer, Ultan McDermott. AI-Driven Identification of Drug-Tolerant Persister Cells in Lung Cancer [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 6312.