1: \begin{abstract}
2: Accurate segmentation is essential for echocardiography-based assessment of cardiovascular diseases (CVDs).
3: However, the variability among sonographers and the inherent challenges of ultrasound images hinder precise segmentation.
4: By leveraging the joint representation of image and text modalities, Vision-Language Segmentation Models (VLSMs) can incorporate rich contextual information, potentially aiding in accurate and explainable segmentation.
5: However, the lack of readily available data in echocardiography hampers the training of VLSMs.
6: In this study, we explore using synthetic datasets from Semantic Diffusion Models (SDMs) to enhance VLSMs for echocardiography segmentation.
7: We evaluate results for two popular VLSMs (CLIPSeg and CRIS) using seven different kinds of language prompts derived from several attributes, automatically extracted from echocardiography images, segmentation masks, and their metadata.
8: Our results show improved metrics and faster convergence when pretraining VLSMs on SDM-generated synthetic images before finetuning on real images.
9: The code, configs, and prompts are available at \url{https://github.com/naamiinepal/synthetic-boost}.
10:
11: \keywords{Vision-Language Models \and Vision-Language Segmentation Models \and Echocardiography \and Synthetic Data}
12: \end{abstract}
13: