DeCodec: Towards Controllable Audio Processing via Speech and Background Sound Decoupling in Neural Codecs

Abstract

General audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, is often mixed with speech and background sounds, and downstream tasks require selective access to these components. Therefore, we propose DeCodec, which hierarchically decoupling audio representations into speech and background sound, and further decomposing speech representations into semantic and paralinguistic components, enabling flexible feature selection across different audio tasks. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement via representation recombination, robustness improvement of speech recognition through decoupled semantic features, and controllable background sound preservation/suppression in speech synthesis.

Figure 1. The overview of the proposed DeCodec.

Results: DeCodec-only

a. Overview of functions

Table 1. Guidelines for using DeCodec to perform speech reconstruction, Speech enhancement, and background sound extraction.

Figure 2. Demos of audio tasks processed by DeCodec.

(a-1) (b-1) (c-1) (d-1)
(a-2) (b-2) (c-2) (d-2)

b. Codec Reconstruction

Table 2. Reconstruction performance of codec models. Best results are highlighted in BOLD.

Demos of Reconstruction performance of Code models

Init Encodec DAC SpeechTokenizer DeCodec

c. Speech Enhancement

Table 3. The DNSMOS scores of SE based on different SE models on the DNS Challenge test set. Best results are highlighted in BOLD.

Demos of SE, BGS extraction and semantic reconstruction

Init StoRM DeCodec: SE DeCodec: BGS extraction DeCodec: semantic reconstruction

e. Ablation study

Table 4. Results of Ablation studies on Decodec on the noisy speech test set. Best results are highlighted in BOLD.

Figure 3. Analysis of the proposed SOP method. Figure 4. Visualization of quantized output of different SRVQ layers of Decodec.

Results: Downstream Tasks

Table 5. The WER* results of ASR based on different codec models. Best results are highlighted in BOLD. Table 6. The subjective results of zero-shot TTS based on different codecs on the noisy speech test set. Best results are highlighted in BOLD.

Demos of zero-shot TTS

Audio Prompt Text Prompt SpeechTokenizer StoRM+SpeechTokenizer DeCodec: remove BGS DeCodec: preserve BGS
On friday confession will be heard all the afternoon after beads.
I must know about you.
Then the leader parted from the line.
Some things is crystal clear I can feel it.
Welcome back, my old friend.

Extra experiment: one-shot VC with different method

Table 7. Guidelines for using DeCodec to perform one-shot VC.

Source Reference SpeechTokenizer StoRM+SpeechTokenizer DeCodec(One-shot VC+SE)

Extra application: film dubbing with VC

Audio prompt Init VC by using NAR of down stream TTS model.