1JIUTIAN Research, China
2State Key Laboratory of Multimedia Information Processing, Peking University, China
General audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, is often mixed with speech and background sounds, and downstream tasks require selective access to these components. Therefore, we propose DeCodec, which hierarchically decoupling audio representations into speech and background sound, and further decomposing speech representations into semantic and paralinguistic components, enabling flexible feature selection across different audio tasks. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement via representation recombination, robustness improvement of speech recognition through decoupled semantic features, and controllable background sound preservation/suppression in speech synthesis.
Figure 1. The overview of the proposed DeCodec.
Table 1. Guidelines for using DeCodec to perform speech reconstruction, Speech enhancement, and background sound extraction.
Figure 2. Demos of audio tasks processed by DeCodec.
| (a-1) | (b-1) | (c-1) | (d-1) |
|---|
| (a-2) | (b-2) | (c-2) | (d-2) |
|---|
Table 2. Reconstruction performance of codec models. Best results are highlighted in BOLD.
| Init | Encodec | DAC | SpeechTokenizer | DeCodec |
|---|---|---|---|---|
Table 3. The DNSMOS scores of SE based on different SE models on the DNS Challenge test set. Best results are highlighted in BOLD.
| Init | StoRM | DeCodec: SE | DeCodec: BGS extraction | DeCodec: semantic reconstruction |
|---|---|---|---|---|
Table 4. Results of Ablation studies on Decodec on the noisy speech test set. Best results are highlighted in BOLD.
![]() |
![]() |
| Figure 3. Analysis of the proposed SOP method. | Figure 4. Visualization of quantized output of different SRVQ layers of Decodec. |
|---|
| Table 5. The WER* results of ASR based on different codec models. Best results are highlighted in BOLD. | Table 6. The subjective results of zero-shot TTS based on different codecs on the noisy speech test set. Best results are highlighted in BOLD. |
|---|---|
![]() |
![]() |
| Audio Prompt | Text Prompt | SpeechTokenizer | StoRM+SpeechTokenizer | DeCodec: remove BGS | DeCodec: preserve BGS |
|---|---|---|---|---|---|
| On friday confession will be heard all the afternoon after beads. | |||||
| I must know about you. | |||||
| Then the leader parted from the line. | |||||
| Some things is crystal clear I can feel it. | |||||
| Welcome back, my old friend. | |||||
Table 7. Guidelines for using DeCodec to perform one-shot VC.
| Source | Reference | SpeechTokenizer | StoRM+SpeechTokenizer | DeCodec(One-shot VC+SE) |
|---|---|---|---|---|
| Audio prompt | Init | VC by using NAR of down stream TTS model. |
|---|---|---|