abstract:049e81da5fae6cf9.tex

1: \begin{abstract}

2: Accurate segmentation is essential for echocardiography-based assessment of cardiovascular diseases (CVDs).

3: However, the variability among sonographers and the inherent challenges of ultrasound images hinder precise segmentation.

4: By leveraging the joint representation of image and text modalities, Vision-Language Segmentation Models (VLSMs) can incorporate rich contextual information, potentially aiding in accurate and explainable segmentation.

5: However, the lack of readily available data in echocardiography hampers the training of VLSMs.

6: In this study, we explore using synthetic datasets from Semantic Diffusion Models (SDMs) to enhance VLSMs for echocardiography segmentation.

7: We evaluate results for two popular VLSMs (CLIPSeg and CRIS) using seven different kinds of language prompts derived from several attributes, automatically extracted from echocardiography images, segmentation masks, and their metadata.

8: Our results show improved metrics and faster convergence when pretraining VLSMs on SDM-generated synthetic images before finetuning on real images.

9: The code, configs, and prompts are available at \url{https://github.com/naamiinepal/synthetic-boost}.

10:

11: \keywords{Vision-Language Models \and Vision-Language Segmentation Models \and Echocardiography \and Synthetic Data}

12: \end{abstract}

13: