abstract:57d4fdffa5410b32.tex

1: \begin{abstract}

2: Conventional methods for speaker diarization involve windowing an audio file into short segments to extract speaker embeddings, followed by an unsupervised clustering of the embeddings.

3: %to generate speaker labels for each segment which is a multi-step approach.

4: This multi-step approach generates speaker assignments for each segment.

5: %The alternate approach of training end-to-end diarization system is cumbersome and data-intensive while also having difficulties in generalizing to a large number of speakers.

6: In this paper, we propose a novel Supervised HierArchical gRaph Clustering algorithm (SHARC) for speaker diarization where we introduce a hierarchical structure using Graph Neural Network (GNN) to perform supervised clustering. The supervision allows the model to update the representations and directly improve the clustering performance, thus enabling a single-step approach for diarization. In the proposed work, the input segment embeddings are treated as nodes of a graph with the edge weights corresponding to the similarity scores between the nodes.

7: We also propose an approach to jointly update the embedding extractor and the GNN model to perform end-to-end speaker diarization (E2E-SHARC).

8: %The model is trained using conversational audio datasets with the ground truth speaker labels.

9: %The proposed approach has shown improvements over the baseline on AMI and Voxconverse datsets.

10: %We perform speaker diarization experiments on benchmark datasets like the AMI setup and Voxconverse 2021 setup.

11: During inference, the hierarchical clustering is performed using node densities and edge existence probabilities to merge the segments until convergence.

12: In the diarization experiments, we illustrate that the proposed E2E-SHARC approach achieves $53\%$ and $44\%$ relative improvements over the baseline systems on benchmark datasets  like AMI and Voxconverse, respectively.

13: \end{abstract}

14: