57d4fdffa5410b32.tex
1: \begin{abstract}
2: Conventional methods for speaker diarization involve windowing an audio file into short segments to extract speaker embeddings, followed by an unsupervised clustering of the embeddings. 
3: %to generate speaker labels for each segment which is a multi-step approach. 
4: This multi-step approach generates speaker assignments for each segment. 
5: %The alternate approach of training end-to-end diarization system is cumbersome and data-intensive while also having difficulties in generalizing to a large number of speakers.
6: In this paper, we propose a novel Supervised HierArchical gRaph Clustering algorithm (SHARC) for speaker diarization where we introduce a hierarchical structure using Graph Neural Network (GNN) to perform supervised clustering. The supervision allows the model to update the representations and directly improve the clustering performance, thus enabling a single-step approach for diarization. In the proposed work, the input segment embeddings are treated as nodes of a graph with the edge weights corresponding to the similarity scores between the nodes.
7: We also propose an approach to jointly update the embedding extractor and the GNN model to perform end-to-end speaker diarization (E2E-SHARC). 
8: %The model is trained using conversational audio datasets with the ground truth speaker labels. 
9: %The proposed approach has shown improvements over the baseline on AMI and Voxconverse datsets. 
10: %We perform speaker diarization experiments on benchmark datasets like the AMI setup and Voxconverse 2021 setup. 
11: During inference, the hierarchical clustering is performed using node densities and edge existence probabilities to merge the segments until convergence. 
12: In the diarization experiments, we illustrate that the proposed E2E-SHARC approach achieves $53\%$ and $44\%$ relative improvements over the baseline systems on benchmark datasets  like AMI and Voxconverse, respectively.
13: \end{abstract}
14: