CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

We organized a SVDD Challenge@SLT 2024 to advance research in this field. Click here to learn more.

CtrSVDD

A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

yongyi.zang@rochester.edu, you.zhang@rochester.edu, zhiyao.duan@rochester.edu, svddchallenge@gmail.com

TL;DR We present our collected novel dataset CtrSVDD and corresponding baselines for controlled singing voice deepfake detection.

Presented at Interspeech 2024

Abstract

Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible.

Dataset Design

The CtrSVDD dataset consists of 220,798 mono vocal clips in a total of 307.98 hours at a sample rate of 16 kHz. Our bonafide singing vocals are sourced from existing open singing datasets, including Mandarin singing datasets: Opencpop, M4Singer, Kising, official ACE-Studio release, and Japanese singing datasets: Ofuton-P, Oniku Kurumi, Kiritan, and JVS-MuSiC. We incorporate 14 deepfake systems across both SVS and SVC to cover many existing architectures, offering a comprehensive evaluation landscape.

Partition	# Speakers	Bonafide	Deepfake
Partition	# Speakers	# Utts	# Utts	Attack Types
Train	59	12,169	72,235	A01∼A08
Dev	55	6,547	37,078	A01∼A08
Eval	48	13,596	79,173	A09∼A14

Summary of our CtrSVDD dataset.

An overview of source datasets and deepfake methods distribution is illustrated in the figure below:

Overview of source datasets and deepfake methods distribution on the train/dev/eval splits of our CtrSVDD data.

Samples from the CtrSVDD Dataset

We select several samples from CtrSVDD for demonstration.

Bonafide sample from M4Singer, Speaker #110

Deepfake sample from M4Singer, Speaker #110, Attack Type A07

Bonafide sample from Opencpop, Speaker #58

Deepfake sample from Opencpop, Speaker #058, Attack Type A01

Bonafide sample from Ofuton-P, Speaker #000

Deepfake sample from Ofuton-P, Speaker #000, Attack Type A05

Baseline systems

We design a versatile baseline framework to facilitate a fair evaluation of diverse front-end representations. The backend is based on AASIST and we use Spectrogram, Mel-Spectrogram, MFCC, LFCC, Raw-waveform as front-ends.

Frontend	EER (%)	Per-method EER (%)
Frontend	EER (%)	A09	A10	A11	A12	A13	A14
Spectrogram	25.50 ± 0.09	32.02 ± 0.09	14.03 ± 0.10	14.67 ± 0.08	35.18 ± 0.31	18.10 ± 0.10	28.55 ± 0.17
Mel-Spectrogram	25.19 ± 0.10	25.29 ± 0.14	15.95 ± 0.15	37.31 ± 0.09	29.28 ± 0.25	12.86 ± 0.08	27.54 ± 0.11
MFCC	26.67 ± 0.07	6.87 ± 0.10	2.50 ± 0.07	4.18 ± 0.06	45.57 ± 0.11	3.28 ± 0.04	42.98 ± 0.08
LFCC	16.15 ± 0.06	5.35 ± 0.07	2.92 ± 0.04	5.84 ± 0.07	29.47 ± 0.06	3.65 ± 0.05	24.00 ± 0.10
Raw Waveform	13.75 ± 0.11	6.72 ± 0.06	0.96 ± 0.05	3.59 ± 0.06	26.83 ± 0.10	0.95 ± 0.04	19.03 ± 0.12

Evaluation results of baseline systems on the evaluation set. Best performing results for each category are illustrated in bold.