DeCodec V2 demo page

Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as an universal disentangled representation learner to enable controllable feature selection across different audio tasks. To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec an universal front-end for multiple audio applications. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic representations, and controllable background sound preservation/suppression in TTS.


(a-1)	(b-1)	(c-1)	(d-1)	(e-1)	(f-1)

(a-1)

(b-1)

(c-1)

(d-1)

(e-1)

(f-1)


(a-2)	(b-2)	(c-2)	(d-2)	(e-2)	(f-2)

(a-2)

(b-2)

(c-2)

(d-2)

(e-2)

(f-2)

Init

Encodec

DAC

SpeechTokenizer

DeCodec

Init

StoRM

DeCodec: SE

DeCodec: BGS extraction

DeCodec: semantic reconstruction

Source

Reference

SpeechTokenizer

StoRM+SpeechTokenizer

DeCodec

Figure 4. Analysis of the proposed SOP method.

Figure 5. Visualization of quantized output of different SRVQ layers of Decodec.

Table 6. The WER* results of ASR based on different codec models. Best results are highlighted in BOLD.

Table 7. The subjective results of zero-shot TTS based on different codecs on the noisy speech test set. Best results are highlighted in BOLD.