Self-supervised learning autonomously extracts features from unlabeled data, supporting downstream segmentation tasks with limited annotations. However, variations in devices, imaging parameters, and other factors lead to differences in the distribution of medical images, resulting in poor model generalizability. Mainstream frameworks include Instance Discrimination, which learns features from different perspectives of the same image but may miss details, and Masked Image Modeling (MIM), which captures local features by predicting masked areas but lacks global information capture. To enhance generalizability by combining global and local information, We introduce the Siamese Evolutionary Masking (SEM) framework, which employs a Siamese architecture composed of an online branch and a target branch. An evolutionary masking strategy is adopted within the online branch, transitioning from grid to block masking during training, encouraging the model to develop more general visual features. Additionally, a module called Switch Decoder aligns the online branch's predicted features with the true features in the target branch, overcoming the challenge of balancing global and local information capture. Experiments on six public datasets, including four skin datasets (SD-260, ISIC2019, ISIC2017, and PH2) and two chest X-ray datasets (Chest X-ray PD and Chest X-ray), demonstrate that SEM achieves strong performance among self-supervised methods. In cross-dataset experiments with different distributions, SEM demonstrated the best segmentation and generalization performance, with Dice scores of 81.8% and 91.1%, Jaccard indices of 72.2% and 84.4%, and optimal HD95% measurements of 13.1% and 10.5%, respectively. Code is available at https://github.com/wsdl666/Siamese-Evolutionary-Masking.