abstract:511c08e6d71bec5a.tex

1: \begin{abstract}

2: % Motivation

3: Acquiring surgical data for research and development is significantly hindered by high annotation costs and practical and ethical constraints. Utilizing synthetically generated images could offer a valuable alternative.

4: % Our task

5: In this work, we explore adapting text-to-image generative models for the surgical domain using the CholecT50 dataset, which provides surgical images annotated with action triplets (instrument, verb, target).

6: % First & second contributions

7: We investigate several language models and find T5 to offer more distinct features for differentiating surgical actions on triplet-based textual inputs, and showcasing stronger alignment between long and triplet-based captions.

8: % Third contribution

9: To address challenges in training text-to-image models solely on triplet-based captions without additional inputs and supervisory signals, we discover that triplet text embeddings are instrument-centric in the latent space. Leveraging this insight, we design an instrument-based class balancing technique to counteract data imbalance and skewness, improving training convergence.

10: % Experiment

11: Extending Imagen, a diffusion-based generative model, we develop {Surgical Imagen} to generate photorealistic and activity-aligned surgical images from triplet-based textual prompts.

12: % Results and conclusion

13: We assess the model on quality, alignment, reasoning, and knowledge, achieving FID and CLIP scores of 3.7 and 26.8\% respectively. Human expert survey shows that participants were highly challenged by the realistic characteristics of the generated samples, demonstrating Surgical Imagen's effectiveness as a practical alternative to real data collection.

14:

15: {\def\thefootnote{}\footnotetext{\it Under consideration at Pattern

16: Recognition Letters}}

17:

18: \end{abstract}

19: