VoxBlink-CN: a Large-Scale Chinese Audio-Visual Speaker Verification Dataset

Name

Xingyu Shen

Major

Data Science

Class

2024

About

This E-portfolio showcases VoxBlink-CN: the Largest Chinese Audio-Visual Dataset, and how it is developed with my DKU experiences.

Signature Work Project Overview

In this paper, we will introduce VoxBlink-CN, a large-scale Chinese audio-visual speaker verification dataset, comprising over 41,739 speakers. The dataset addresses critical limitations inherent in existing Chinese audio datasets, such as the restricted number of speakers, labeling inaccuracies, and challenges in achieving effective generalization across diverse audio-visual contexts. By leveraging an automated pipeline for efficient data extraction from Bilibili-a famous Chinese video-sharing platform-we significantly expanded the diversity and volume of Chinese audio data available for research. Our methodology involves the development of a novel video processing pipeline capable of extracting speaker-specific clips from extensive video materials, ensuring high precision and diversity in dataset composition. The dataset has a great potential to dramatically enhance model performance in Chinese speaker verification tasks, underscoring its considerable scale and scope compared to existing datasets. The comprehensive design of the pipeline also guarantees both efficiency and stability. VoxBlink-CN’s development reflects a significant advancement in the resources available for Chinese language audio-visual research, offering a robust foundation for future explorations in speech recognition, speaker verification, and multimodal machine learning applications.

Signature Work Presentation Video

SIGNATURE WORKCONFERENCE & EXHIBITION 2024