We organized a SVDD Challenge@SLT 2024 to advance research in this field. Click here to learn more.

CtrSVDD

A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda, Zhiyao Duan

TL;DR We present our collected novel dataset CtrSVDD and corresponding baselines for controlled singing voice deepfake detection.

Presented at Interspeech 2024

Abstract

Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible.

Dataset Design

The CtrSVDD dataset consists of 220,798 mono vocal clips in a total of 307.98 hours at a sample rate of 16 kHz. Our bonafide singing vocals are sourced from existing open singing datasets, including Mandarin singing datasets: Opencpop, M4Singer, Kising, official ACE-Studio release, and Japanese singing datasets: Ofuton-P, Oniku Kurumi, Kiritan, and JVS-MuSiC. We incorporate 14 deepfake systems across both SVS and SVC to cover many existing architectures, offering a comprehensive evaluation landscape.

Partition # Speakers Bonafide Deepfake
# Utts # Utts Attack Types
Train 59 12,169 72,235 A01∼A08
Dev 55 6,547 37,078 A01∼A08
Eval 48 13,596 79,173 A09∼A14

Summary of our CtrSVDD dataset.

An overview of source datasets and deepfake methods distribution is illustrated in the figure below:

Overview of source datasets and deepfake methods distribution on the train/dev/eval splits of our CtrSVDD data.

Samples from the CtrSVDD Dataset

We select several samples from CtrSVDD for demonstration.

Bonafide sample from M4Singer, Speaker #110

Deepfake sample from M4Singer, Speaker #110, Attack Type A07

Bonafide sample from Opencpop, Speaker #58

Deepfake sample from Opencpop, Speaker #058, Attack Type A01

Bonafide sample from Ofuton-P, Speaker #000

Deepfake sample from Ofuton-P, Speaker #000, Attack Type A05

Baseline systems

We design a versatile baseline framework to facilitate a fair evaluation of diverse front-end representations. The backend is based on AASIST and we use Spectrogram, Mel-Spectrogram, MFCC, LFCC, Raw-waveform as front-ends.

Frontend EER (%) Per-method EER (%)
A09 A10 A11 A12 A13 A14
Spectrogram 25.50 ± 0.09 32.02 ± 0.09 14.03 ± 0.10 14.67 ± 0.08 35.18 ± 0.31 18.10 ± 0.10 28.55 ± 0.17
Mel-Spectrogram 25.19 ± 0.10 25.29 ± 0.14 15.95 ± 0.15 37.31 ± 0.09 29.28 ± 0.25 12.86 ± 0.08 27.54 ± 0.11
MFCC 26.67 ± 0.07 6.87 ± 0.10 2.50 ± 0.07 4.18 ± 0.06 45.57 ± 0.11 3.28 ± 0.04 42.98 ± 0.08
LFCC 16.15 ± 0.06 5.35 ± 0.07 2.92 ± 0.04 5.84 ± 0.07 29.47 ± 0.06 3.65 ± 0.05 24.00 ± 0.10
Raw Waveform 13.75 ± 0.11 6.72 ± 0.06 0.96 ± 0.05 3.59 ± 0.06 26.83 ± 0.10 0.95 ± 0.04 19.03 ± 0.12

Evaluation results of baseline systems on the evaluation set. Best performing results for each category are illustrated in bold.