In the field of medical image analysis, accurate lesion segmentation is beneficial for the subsequent clinical diagnosis and treatment planning. Currently, various deep learning-based methods have been proposed to deal with the segmentation task. Albeit achieving some promising performances, the fully-supervised learning approaches require pixel-level annotations for model training, which is tedious and time-consuming for experienced radiologists to collect. In this paper, we propose a weakly semi-supervised segmentation framework, called Point Segmentation Transformer (Point SEGTR). Particularly, the framework utilizes a small amount of fully-supervised data with pixel-level segmentation masks and a large amount of weakly-supervised data with point-level annotations (i.e., annotating a point inside each object) for network training, which largely reduces the demand of pixel-level annotations significantly. To fully exploit the pixel-level and point-level annotations, we propose two regularization terms, i.e., multi-point consistency and symmetric consistency, to boost the quality of pseudo labels, which are then adopted to train a student model for inference. Extensive experiments are conducted on three endoscopy datasets with different lesion structures and several body sites (e.g., colorectal and nasopharynx). Comprehensive experimental results finely substantiate the effectiveness and the generality of our proposed method, as well as its potential to loosen the requirements of pixel-level annotations, which is valuable for clinical applications.