ObjectiveThe purpose of the current study was to develop and validate a biomarker-based prediction model for metastasis in patients with colorectal cancer (CRC).MethodsTwo datasets, GSE68468 and GSE41568, were retrieved from the Gene Expression Omnibus (GEO) database. In the GSE68468 dataset, key biomarkers were identified through a screening process involving differential expression analysis, redundancy analysis, and recursive feature elimination technique. Subsequently, the prediction model was developed and internally validated using five machine learning (ML) algorithms including lasso and elastic-net regularized generalized linear model (glmnet), k-nearest neighbors (kNN), support vector machine (SVM) with Radial Basis Function Kernel, random forest (RF), and eXtreme Gradient Boosting (XGBoost). The predictive performance of the algorithm with the highest accuracy was then externally validated on the GSE41568 dataset.ResultsAmong 22,283 registered genes in the GSE68468 dataset, the screening process identified 16 key genes including MMP3, CCDC102B, CDH2, SCGB1A1, KRT7, CYP1B1, LAMC3, ALB, DIXDC1, VWF, MMP1, CYP4B1, NKX3-2, TMEM158, GADD45B, SERPINA1 and these genes were used to build the prediction model. On the internal validation dataset, the prediction performance of five ML algorithms was as follows; RF (accuracy = 0.97 and kappa = 0.91), XGBoost (0.93, 0.81), kNN (0.93, 0.81), glmnet (0.93, 0.82) and SVM (0.92, 0.80). Top five biomarkers were MMP3, CCDC102B, CDH2, VWF and MMP1. The RF model exhibited an accuracy of 0.97, a kappa value of 0.92, and an area under the curve (AUC) of 0.99 in the external validation dataset.ConclusionThe results of this study have identified biomarkers through ML algorithms which help to identify patients with CRC prone to metastasis.