Zero-shot sketch-based image retrieval (ZS-SBIR) is challenging due to the cross-domain nature of sketches and photos, as well as the semantic gap between seen and unseen classes. With the rapid advancements in modern large vision-language models (VLMs), traditional approaches that relied solely on small vision encoders have gradually been supplanted. However, both the traditional vision encoder-based methods and the modern VLMs-based methods have their limitations, and no unified approach effectively addresses both. In this paper, we present an effective "Adaptation and Alignment (AdaAlign)" approach to address the key challenges. Specifically, we insert lightweight Adapter or LoRA to learn new abstract concepts of the sketches and improve cross-domain representation capabilities, which helps alleviate domain heterogeneity. Then, we propose to directly align the learned image embedding with the more semantically rich text embedding within a distillation framework to bridge the semantic gap. This enables the model to learn more generalizable visual representations from linguistic semantic cues. We integrate our key innovations into both traditional small models (e.g., ResNet50 or DINO-S) and modern VLMs (e.g., SigLIP), resulting in state-of-the-art performance. Extensive experiments on three benchmark datasets demonstrate the superiority of our method in terms of retrieval accuracy and flexibility.