Intraoperative hypotension (IOH) is associated with an increased risk of heart and kidney complications. Although AI tools aim to predict IOH, their real-world reliability is often overstated due to biased data selection. This study introduces a framework to enhance reliability by: (1) including borderline blood pressure cases (65-75 mmHg, the "Gray Zone"), (2) comparing AI model to simple blood pressure threshold, and (3) validating across diverse surgical cohorts, centers and demographics. Using datasets from Karolinska University Hospital (Sweden) and VitalDB (Korea), we found AI model performs better than MAP threshold method in more ambiguous cases. In contrast, when hypotensive and non-hypotensive cases had clearly separated MAP values, both methods performed similarly well. Cross-validation revealed asymmetric generalizability: models trained on datasets containing more borderline (Gray Zone) cases generalized better to datasets with clearer class separation, whereas the reverse struggled. To ensure fair model comparison and reduce dataset-specific bias, we standardized the MAP difference between positive (hypotension) and negative (non-hypotension) samples at the time of prediction. This virtually eliminated the class separation and demonstrated that inflated performance in some datasets can be attributed to selection bias rather than true model generalizability. Age also influenced generalization: Cross-age validation revealed models trained on older patients generalized better to younger cohorts, whereas differences in ASA classification had minimal effect. These findings highlight the need for realistic validation to bridge the gap between AI research and clinical practice.