Accurate analysis of tumor morphology, vascularity, and tissue stiffness under multimodal ultrasound imaging plays a critical role in the diagnosis of breast cancer. However, manual interpretation across multiple modalities is time-consuming and heavily dependent on the radiologist's expertise. Computer-aided classification offers an efficient alternative, yet remains challenging due to significant modality heterogeneity, inconsistent image quality, and redundant information across modalities. To address these issues, we propose a novel Multimodal Sparse Fusion Transformer Network (MSFT-Net). First, a Spatio-Temporal Decoupling Attention architecture (STDA) is introduced to disentangle and extract dynamic and static features from different modalities along spatial and temporal dimensions, capturing modality-specific motion and morphological characteristics independently. Second, the Mixed-Scale Convolution Module (MSCM) obtains tumor features at multiple scales, enhancing geometric detail representation and improving receptive field coverage. Third, the Sparse Cross-Attention Module (SCAM) adaptively retains the most effective query-key interactions between modalities, thereby facilitating the aggregation of high-quality features for robust multimodal information fusion. MSFT-Net is trained and tested on a curated dataset comprising multimodal breast tumor videos collected from 458 patients, including ultrasound (US), superb microvascular imaging (SMI), and strain elastography (SE), and its generalizability is further validated on the public BraTS'21 MRI dataset. Extensive experiments demonstrate that MSFT-Net achieves superior performance in multimodal breast tumor classification compared to state-of-the-art methods, providing fast and reliable support for radiologists in diagnostic tasks.