
Resolution Consistency Training on Time-Frequency Domain for Semi-Supervised Sound Event Detectionįine-tuning Audio Spectrogram Transformer with Task-aware Adapters for Sound Event Detection TFECN: Time-Frequency Enhanced ConvNet for Audio Classification Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Penghui Wen, Kun Hu, Wenxi Yue, Sen Zhang, Wanlei Zhou, Zhiyong WangĪdapting Language-Audio Models as Few-Shot Audio Learners Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms Saksham Singh Kushwaha, Magdalena Fuentes Xiao-Min Zeng, Yan Song, Ian McLoughlin, Lin Liu, Li-Rong DaiĪ multimodal prototypical approach for unsupervised sound classification Robust Prototype Learning for Anomalous Sound Detection Selective Biasing with Trie-based Contextual Adapters for Personalised Speech Recognition using Neural Transducers Kinan Martin, Jon Gauthier, Canaan Breiss, Roger Levy Probing Self-supervised Speech Models for Phonetic and Phonemic Information: A Case Study in Aspiration Yingying Gao, Shilei Zhang, Zihao Cui, Chao Deng, Junlan Feng GalesĬascaded Multi-task Adaptive Learning Based on Neural Architecture Search Yassir Fathullah, Chunyang Wu, Yuan Shangguan, Junteng Jia, Wenhan Xiong, Jay Mahadeokar, Chunxi Liu, Yangyang Shi, Ozlem Kalinli, Mike Seltzer, Mark J. Multi-Head State Space Model for Speech Recognition Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Suo Hongbin, Yulong Wanįactual Consistency Oriented Speech Recognition Task-Agnostic Structured Pruning of Speech Representation Models Hang Zhou, Xiaoxu Zheng, Yunhe Wang, Michael Bi Mi, Deyi Xiong, Kai Han GhostRNN: Reducing State Redundancy in RNN with Cheap Operations Zelin Wu, Tsendsuren Munkhdalai, Pat Rondon, Golan Pundak, Khe Chai Sim, Christopher Li Hongfei Xue, Qijie Shao, Peikun Chen, Pengcheng Guo, Lei Xie, Jie Liuĭual-Mode NAM: Effective Top-K Context Injection for End-to-End ASR TranUSR: Phoneme-to-word Transcoder Based Unified Speech Representation Learning for Cross-lingual Speech Recognition LABERT: A Combination of Local Aggregation and Self-Supervised Speech Representation Learning for Detecting Informative Hidden Units in Low-Resource ASR Systems Mayank Kumar Singh, Naoya Takahashi, Naoyuki Onoe Iteratively Improving Speech Recognition and Voice Conversion Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech Tamas Grosz, Yaroslav Getman, Ragheb Al-Ghezi, Aku Rouhe, Mikko Kurimo Investigating wav2vec2 context representations and the effects of fine-tuning, a case-study of a Finnish model Yochai Blau, Rohan Agrawal, Lior Madmony, Gary Wang, Andrew Rosenberg, Zhehuai Chen, Zorik Gekhman, Genady Beryozkin, Parisa Haghani, Bhuvana Ramabhadran Using Text Injection to Improve Recognition of Personal Identifiers in Speech Speech Recognition: Signal Processing, Acoustic Modeling, Robustness, Adaptation 1 Léa-Marie Lam-Yee-Mui, Lucas Ondel Yang, Ondřej Klejch Ziyang Ma, Zhisheng Zheng, Changli Tang, Yujin Wang, Xie ChenĬomparing Self-Supervised Pre-Training and Semi-Supervised Training for Speech Recognition in Languages with Weak Language Models MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik Audhkhasi O-1: Self-training with Oracle and 1-best Hypothesis Zhao Yang, Dianwen Ng, Chong Zhang, Xiao Fu, Rui Jiang, Wei Xi, Yukun Ma, Chongjia Ni, Eng Siong Chng, Bin Ma, Jizhong Zhao Salah Zaiem, Titouan Parcollet, Slim Essidĭual Acoustic Linguistic Self-supervised Representation Learning for Cross-Domain Speech Recognition Yifan Peng, Yui Sudo, Shakeel Muhammad, Shinji WatanabeĪutomatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations Guangyan Zhang, Thomas Merritt, Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-TruebaĭPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models Rui Liu, Haolin Zuo, De Hu, Guanglai Gao, Haizhou LiĬomparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpusĭetai Xin, Shinnosuke Takamichi, Ai Morimatsu, Hiroshi SaruwatariĮxplicit Intensity Control for Accented Text-to-speech Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao Zhao-Ci Liu, Zhen-Hua Ling, Ya-Jun Hu, Jia Pan, Jin-Wei Wang, Yun-Di WuĮmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Speech Synthesis with Self-Supervisedly Learnt Prosodic Representations Jianrong Wang, Yaxin Zhao, Li Liu, Tianyi Xu, Qi Li, Sen Li

Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks
