Cardiovascular implantable electronic devices (CIEDs) induce severe off-resonance artifacts in balanced steady-state free precession (bSSFP) cine MRI, limiting diagnostic utility for a growing patient population. While supervised and unpaired learning methods have shown promise for artifact suppression, their reliance on paired ground truth or artifact-free domains renders them clinically impractical for CIED imaging. To address this, we propose a self-supervised framework that integrates Noise2Noise, physics-driven multi-instance contrastive learning, and an anisotropic spatiotemporal transformer to eliminate the need for clean data. Central to our approach is the exploitation of bSSFP phase cycling's linear combination property: multiple artifact-corrupted acquisitions with incremental RF phase shifts are leveraged as anatomically consistent "pseudo-pairs." A novel multi-instance contrastive loss enforces consistency between artifact-suppressed outputs of these pairs, compensating for the finite-sample bias and spatially correlated artifacts that violate conventional Noise2Noise assumptions. Further, an anisotropic spatiotemporal transformer hierarchically models long-range dependencies using anisotropic spatial and spatiotemporal attention windows with a better alignment with cardiac anatomy, preserving myocardial texture and dynamic motion. Experiments on simulated and real CIED datasets demonstrate an improved performance relative to alternative methods. This work bridges the gap between idealized statistical learning and MRI physics, providing a feasible solution in real-world cardiac cine imaging when ground truth is inaccessible.