1: \begin{abstract}
2: Conventional methods for speaker diarization involve windowing an audio file into short segments to extract speaker embeddings, followed by an unsupervised clustering of the embeddings.
3: %to generate speaker labels for each segment which is a multi-step approach.
4: This multi-step approach generates speaker assignments for each segment.
5: %The alternate approach of training end-to-end diarization system is cumbersome and data-intensive while also having difficulties in generalizing to a large number of speakers.
6: In this paper, we propose a novel Supervised HierArchical gRaph Clustering algorithm (SHARC) for speaker diarization where we introduce a hierarchical structure using Graph Neural Network (GNN) to perform supervised clustering. The supervision allows the model to update the representations and directly improve the clustering performance, thus enabling a single-step approach for diarization. In the proposed work, the input segment embeddings are treated as nodes of a graph with the edge weights corresponding to the similarity scores between the nodes.
7: We also propose an approach to jointly update the embedding extractor and the GNN model to perform end-to-end speaker diarization (E2E-SHARC).
8: %The model is trained using conversational audio datasets with the ground truth speaker labels.
9: %The proposed approach has shown improvements over the baseline on AMI and Voxconverse datsets.
10: %We perform speaker diarization experiments on benchmark datasets like the AMI setup and Voxconverse 2021 setup.
11: During inference, the hierarchical clustering is performed using node densities and edge existence probabilities to merge the segments until convergence.
12: In the diarization experiments, we illustrate that the proposed E2E-SHARC approach achieves $53\%$ and $44\%$ relative improvements over the baseline systems on benchmark datasets like AMI and Voxconverse, respectively.
13: \end{abstract}
14: