DeCodec: Rethinking Audio Codecs as Universal disentangled representation Learners
Abstract
Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as a universal disentangled representation learner to enable controllable feature selection across different audio tasks.
To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec a universal front-end for multiple audio applications.
Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition.
Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities:
superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS.
Figure 1. The overview of the proposed DeCodec.
Results: DeCodec-only
a. Overview of functions
Table 1. Guidelines for using DeCodec to perform speech reconstruction, SE, background sound extraction, one-shot VC, and one-shot VC+SE functions.
Figure 3. Demos of audio tasks processed by DeCodec.
(a-1)
(b-1)
(c-1)
(d-1)
(e-1)
(f-1)
(a-2)
(b-2)
(c-2)
(d-2)
(e-2)
(f-2)
b. Codec Reconstruction
Table 2. Reconstruction quality evaluation of codec models. Best results are highlighted in BOLD.
Demos of Reconstruction performance of Code models
Init
Encodec
DAC
SpeechTokenizer
DeCodec
c. Speech Enhancement
Table 3. The DNSMOS scores of SE based on different SE models on the DNS Challenge test set. Best results are highlighted in BOLD.
Demos of SE, BGS extraction and semantic reconstruction
Init
StoRM
DeCodec: SE
DeCodec: BGS extraction
DeCodec: semantic reconstruction
d. one-shot VC
Table 4. Results of one-shot VC on different codec models on the noisy speech test set. Best results are highlighted in BOLD.
Demos of one-shot VC with different methods
Source
Reference
SpeechTokenizer
StoRM+SpeechTokenizer
DeCodec
e. Ablation study
Table 5. Results of Ablation studies on Decodec on the noisy speech test set. Best results are highlighted in BOLD.
Figure 4. Analysis of the proposed SOP method.
Figure 5. Visualization of quantized output of different SRVQ layers of Decodec.
Results: Downstream Tasks
Table 6. The WER* results of ASR based on different codec models. Best results are highlighted in BOLD.
Table 7. The subjective results of zero-shot TTS based on different codecs on the noisy speech test set. Best results are highlighted in BOLD.
Demos of zero-shot TTS
Audio Prompt
Text Prompt
SpeechTokenizer
StoRM+SpeechTokenizer
DeCodec: remove BGS
DeCodec: preserve BGS
On friday confession will be heard all the afternoon after beads.