Near-infrared (NIR) spectroscopy, a pivotal tool within process analytical technology (PAT), offers significant potential for real-time monitoring of quality marker (Q-Marker) concentrations in traditional Chinese medicine (TCM) extracts to ensure batch-to-batch consistency. However, interference factors such as noise, mechanical vibrations, and temperature fluctuations in industrial extraction environments can increase spectral data instability, reduce measurement repeatability, and cause baseline drift, thereby diminishing the prediction accuracy of NIR spectroscopy models. To address these challenges, we propose a Multi-Source Cross-Scale Attention Fusion Network (MSCAF-Net), which integrates spectral data from two NIR spectrometers (Bruker MATRIX-F II and Optosky ATP8000) to fuse complementary spectral information. This approach captures more effective features, reduces the signal-to-noise ratio, and enhances the robustness and accuracy of NIR spectral predictions. The MSCAF-Net architecture incorporates a cross-scale feature extraction module to harmonize spectral inputs, followed by a multi-head attention mechanism to selectively focus on critical features. These fused features are subsequently processed through a three-layer convolutional neural network with varying kernel sizes to perform regression-based predictions. The model was validated using a dataset comprising 1008 NIR spectra and 1512 corresponding concentration measurements, collected from the pilot-scale TCM production line for Xuefu Zhuyu Oral Liquid (XZOL). Experimental results demonstrate that MSCAF-Net achieves superior performance, with R2 values of 0.9870, 0.9723, and 0.8953 for the quantitative prediction of three Q-Markers-naringin, paeoniflorin, and amygdalin, respectively-outperforming both single-spectrometer models and recent fusion-based approaches. These findings highlight the practical value of MSCAF-Net for real-time monitoring and quality control in TCM production.