DeCodec: Rethinking Audio Codecs as Universal disentangled representation Learners

Abstract

Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as a universal disentangled representation learner to enable controllable feature selection across different audio tasks. To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec a universal front-end for multiple audio applications. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS.

Figure 1. The overview of the proposed DeCodec.

Results: DeCodec-only

a. Overview of functions

Table 1. Guidelines for using DeCodec to perform speech reconstruction, SE, background sound extraction, one-shot VC, and one-shot VC+SE functions.

Figure 3. Demos of audio tasks processed by DeCodec.

(a-1) (b-1) (c-1) (d-1) (e-1) (f-1)
(a-2) (b-2) (c-2) (d-2) (e-2) (f-2)

b. Codec Reconstruction

Table 2. Reconstruction quality evaluation of codec models. Best results are highlighted in BOLD.

Demos of Reconstruction performance of Code models

Init Encodec DAC SpeechTokenizer DeCodec

c. Speech Enhancement

Table 3. The DNSMOS scores of SE based on different SE models on the DNS Challenge test set. Best results are highlighted in BOLD.

Demos of SE, BGS extraction and semantic reconstruction

Init StoRM DeCodec: SE DeCodec: BGS extraction DeCodec: semantic reconstruction

d. one-shot VC

Table 4. Results of one-shot VC on different codec models on the noisy speech test set. Best results are highlighted in BOLD.

Demos of one-shot VC with different methods

Source Reference SpeechTokenizer StoRM+SpeechTokenizer DeCodec

e. Ablation study

Table 5. Results of Ablation studies on Decodec on the noisy speech test set. Best results are highlighted in BOLD.

Figure 4. Analysis of the proposed SOP method. Figure 5. Visualization of quantized output of different SRVQ layers of Decodec.

Results: Downstream Tasks

Table 6. The WER* results of ASR based on different codec models. Best results are highlighted in BOLD. Table 7. The subjective results of zero-shot TTS based on different codecs on the noisy speech test set. Best results are highlighted in BOLD.

Demos of zero-shot TTS

Audio Prompt Text Prompt SpeechTokenizer StoRM+SpeechTokenizer DeCodec: remove BGS DeCodec: preserve BGS
On friday confession will be heard all the afternoon after beads.
I must know about you.
Then the leader parted from the line.
Some things is crystal clear I can feel it.
Welcome back, my old friend.

An extra fun application—film dubbing with VC

Audio prompt Init VC by using NAR of down stream TTS model.