Traffic accidents usually result from driver's inattention, sleepiness, and distraction, posing a substantial danger to worldwide road safety. Advances in computer vision and artificial intelligence (AI) have provided new prospects for designing real-time driver monitoring systems to reduce these dangers. In this paper, we assessed four known deep learning models, MobileNetV2, DenseNet201, NASNetMobile, and VGG19, and offer a unique Hybrid CNN-Transformer architecture reinforced with Efficient Channel Attention (ECA) for multi-class driver activity categorization. The framework defines seven important driving behaviors: Closed Eye, Open Eye, Dangerous Driving, Distracted Driving, Drinking, Yawning, and Safe Driving. Among the baseline models, DenseNet201 (99.40%) and MobileNetV2 (99.31%) achieved the highest validation accuracies. In contrast, the proposed Hybrid CNN-Transformer with ECA attained a near-perfect validation accuracy of 99.72% and further demonstrated flawless generalization with 100% accuracy on the independent test set. Confusion matrix studies further indicate a few misclassifications, verifying the model's high generalization capacity. By merging CNN-based local feature extraction, attention-driven feature refinement, and Transformer-based global context modeling, the system provides both robustness and efficiency. These findings show the practicality of using the suggested technology in real-time intelligent transportation applications, presenting a viable avenue toward reducing traffic accidents and boosting overall road safety.