511c08e6d71bec5a.tex
1: \begin{abstract}
2: % Motivation
3: Acquiring surgical data for research and development is significantly hindered by high annotation costs and practical and ethical constraints. Utilizing synthetically generated images could offer a valuable alternative.
4: % Our task
5: In this work, we explore adapting text-to-image generative models for the surgical domain using the CholecT50 dataset, which provides surgical images annotated with action triplets (instrument, verb, target). 
6: % First & second contributions
7: We investigate several language models and find T5 to offer more distinct features for differentiating surgical actions on triplet-based textual inputs, and showcasing stronger alignment between long and triplet-based captions. 
8: % Third contribution
9: To address challenges in training text-to-image models solely on triplet-based captions without additional inputs and supervisory signals, we discover that triplet text embeddings are instrument-centric in the latent space. Leveraging this insight, we design an instrument-based class balancing technique to counteract data imbalance and skewness, improving training convergence. 
10: % Experiment
11: Extending Imagen, a diffusion-based generative model, we develop {Surgical Imagen} to generate photorealistic and activity-aligned surgical images from triplet-based textual prompts. 
12: % Results and conclusion
13: We assess the model on quality, alignment, reasoning, and knowledge, achieving FID and CLIP scores of 3.7 and 26.8\% respectively. Human expert survey shows that participants were highly challenged by the realistic characteristics of the generated samples, demonstrating Surgical Imagen's effectiveness as a practical alternative to real data collection.
14: 
15: {\def\thefootnote{}\footnotetext{\it Under consideration at Pattern
16: Recognition Letters}}
17: 
18: \end{abstract}
19: