DeCodec: Towards Controllable Audio Processing via Speech and Background Sound Decoupling in Neural Codecs
Abstract
General audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, is often mixed with speech and background sounds, and downstream tasks require selective access to these components.
Therefore, we propose DeCodec, which hierarchically decoupling audio representations into speech and background sound, and further decomposing speech representations into semantic and paralinguistic components, enabling flexible feature selection across different audio tasks.
Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition.
Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities:
superior speech enhancement via representation recombination, robustness improvement of speech recognition through decoupled semantic features, and controllable background sound preservation/suppression in speech synthesis.
Figure 1. The overview of the proposed DeCodec.
Results: DeCodec-only
a. Overview of functions
Table 1. Guidelines for using DeCodec to perform speech reconstruction, Speech enhancement, and background sound extraction.
Figure 2. Demos of audio tasks processed by DeCodec.
(a-1)
(b-1)
(c-1)
(d-1)
(a-2)
(b-2)
(c-2)
(d-2)
b. Codec Reconstruction
Table 2. Reconstruction performance of codec models. Best results are highlighted in BOLD.
Demos of Reconstruction performance of Code models
Init
Encodec
DAC
SpeechTokenizer
DeCodec
c. Speech Enhancement
Table 3. The DNSMOS scores of SE based on different SE models on the DNS Challenge test set. Best results are highlighted in BOLD.
Demos of SE, BGS extraction and semantic reconstruction
Init
StoRM
DeCodec: SE
DeCodec: BGS extraction
DeCodec: semantic reconstruction
e. Ablation study
Table 4. Results of Ablation studies on Decodec on the noisy speech test set. Best results are highlighted in BOLD.
Figure 3. Analysis of the proposed SOP method.
Figure 4. Visualization of quantized output of different SRVQ layers of Decodec.
Results: Downstream Tasks
Table 5. The WER* results of ASR based on different codec models. Best results are highlighted in BOLD.
Table 6. The subjective results of zero-shot TTS based on different codecs on the noisy speech test set. Best results are highlighted in BOLD.
Demos of zero-shot TTS
Audio Prompt
Text Prompt
SpeechTokenizer
StoRM+SpeechTokenizer
DeCodec: remove BGS
DeCodec: preserve BGS
On friday confession will be heard all the afternoon after beads.
I must know about you.
Then the leader parted from the line.
Some things is crystal clear I can feel it.
Welcome back, my old friend.
Extra experiment: one-shot VC with different method
Table 7. Guidelines for using DeCodec to perform one-shot VC.