中国科学技术大学 Yang Ai--Home--Home

aiyang

Special Associate Researcher
Supervisor of Master's Candidates
Name (English):Yang Ai
Name (Pinyin):aiyang
E-Mail:
Education Level:Postgraduate (Doctoral)
Degree:Dr
Professional Title:Special Associate Researcher
Alma Mater:中国科学技术大学
Teacher College:School of information Science and Technology
Discipline:Information and Communication Engineering

Contact Information

ZipCode：
PostalAddress：
Telephone：
Email：

Profile
Research Focus
Honors & Awards
Social Affiliations

艾杨，现任中国科学技术大学信息科学技术学院电子工程与信息科学系特任副研究员，主要研究方向包括语音合成、语音增强、语音分离、音频编码和音频质量评价等，在语音领域顶刊IEEE TASLP及语音领域顶会ICASSP/Interspeech等上共发表论文50余篇。入选2024年度“小米青年学者”。

教育经历

2012年9月—2016年6月厦门大学通信工程专业学士
2016年9月—2021年6月中国科学技术大学信息与通信工程专业博士（导师：凌震华教授）

科研与学术工作经历

2020年2月—2020年8月日本国立情报学研究所联合培养博士生
2021年7月—2022年3月国防科技大学讲师
2022年4月—2023年12月中国科学技术大学博士后研究员
2024年1月至今中国科学技术大学特任副研究员

科研项目

主持项目

国家自然科学基金委员会，国家自然科学基金青年项目，面向语音生成的抗卷绕相位谱预测，2024-01 至 2026-12，30万元
安徽省科学技术厅，安徽省自然科学基金青年项目，结合相位预测的高质量高效率辅助式语音增强方法研究，2023-09 至 2025-08，8万元
中国科学技术大学, 青年创新基金, 高效率高鲁棒的神经网络声码器研究, 2023-01 至 2024-12，9万元

参与项目

科技部，科技部攻关项目子课题，智能语音移植模型和算法工具包研发，2022-01至2024-12，500万元（排名2/34）
国家自然科学基金委员会，国家自然科学基金联合项目，感知驱动的细粒度语音表征解耦与跨模态可控语音语音合成，2024-01 至 2027-12，260万元（排名7/21）
中科院, 战略性先导科技专项（C类）课题，多语种语音合成关键技术，2020-01 至 2022-12，1632万元（排名2/35）
科技部, 国家重点研发计划项目课题，面向冬奥场景的多语种语音处理关键技术，2019-10 至 2022-06，338万元 (3/31)
国家自然科学基金委员会, 国家自然科学基金面上项目，面向语音合成的神经网络声码器研究，2019-01-01 至 2022-12-31， 63万元（排名7/8）

论文发表

2022年及以后

第一作者+通讯作者论文列表

Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, and Zhen-Hua Ling*, “APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3256–3269, 2024.
Yang Ai, and Zhen-Hua Ling*, “Low-latency neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses for speech generation tasks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2283–2296, 2024.
Yang Ai, and Zhen-Hua Ling*, “APNet: An all-frame-level neural vocoder incorporating direct prediction of amplitude and phase spectra,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2145–2157, 2023.
Yang Ai*, Zhen-Hua Ling, Wei-Lu Wu and Ang Li, “Denoising-and-dereverberation hierarchical neural vocoder for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2036–2048, 2022.
Yang Ai, Ye-Xin Lu and Zhen-Hua Ling*, “Long-frame-shift neural speech phase prediction with spectral continuity enhancement and interpolation error compensation,” IEEE Signal Processing Letters, vol. 30, pp. 1097-1101, 2023.
Yang Ai, Ye-Xin Lu, Xiao-Hang Jiang, Zheng-Yan Sheng, Rui-Chen Zheng, and Zhen-Hua Ling*, “A low-bitrate neural audio codec framework with bandwidth reduction and recovery for high-sampling-rate waveforms,” in Proc. Interspeech, 2024, pp. 1765-1769.
Yang Ai, and Zhen-Hua Ling*, “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” in Proc. ICASSP, 2023, pp. 1-5.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement,” Neural Networks, vol. 189, pp. 107562, 2025.
Ye-Xin Lu, Yang Ai*, Hui-Peng Du, and Zhen-Hua Ling, “Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 236–250, 2025.
Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Yang Ai*, and Zhen-Hua Ling, “ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,” accepted by IEEE Transactions on Audio, Speech, and Language Processing, 2025.
Xiao-Hang Jiang, Yang Ai*, Rui-Chen Zheng, and Zhen-Hua Ling, “A streamable neural audio codec with residual scalar-vector quantization for real-time communication,” accepted by IEEE Signal Processing Letters, 2025.
Rui-Chen Zheng, Yang Ai*, Zhen-Hua Ling, “Speech reconstruction from silent lip and tongue articulation by diffusion models and text-guided pseudo target generation,” in Proc. ACM MM, 2024, pp. 6559-6568.
Ye-Xin Lu, Yang Ai*, Zheng-Yan Sheng, and Zhen-Hua Ling, “Multi-stage speech bandwidth extension with flexible sampling rates control,” in Proc. Interspeech, 2024, pp. 2270-2274.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “BiVocoder: A bidirectional neural vocoder integrating feature extraction and waveform generation,” in Proc. Interspeech, 2024, pp. 3894-3898.
Fei Liu, Yang Ai*, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, and Zhen-Hua Ling, “Stage-wise and prior-aware neural speech phase prediction,” in Proc. SLT, 2024, pp. 648-654.
Xiao-Hang Jiang, Yang Ai*, Rui-Chen Zheng, Hui-Peng Du, Ye-Xin Lu, and Zhen-Hua Ling, “MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,” in Proc. SLT, 2024, pp. 550-557.
Yu-Fei Shi, Yang Ai*, Ye-Xin Lu, Hui-Peng Du, and Zhen-Hua Ling, “Pitch-and-spectrum-aware singing quality assessment with bias correction and model fusion,” in Proc. SLT, 2024, pp. 821-827.
Hui-Peng Du, Yang Ai*, Rui-Chen Zheng, and Zhen-Hua Ling, “APCodec+: A spectrum-coding-based high-fidelity and high-compression-rate neural audio codec with staged training paradigm,” in Proc. ISCSLP, 2024, pp. 676-680.
Yu-Fei Shi, Ye-Xin Lu, Yang Ai*, Hui-Peng Du, and Zhen-Hua Ling, “SAMOS: A neural MOS prediction model leveraging semantic representations and acoustic features,” in Proc. ISCSLP, 2024, pp. 199-203.
Xiao-Hang Jiang, Hui-Peng Du, Yang Ai*, Ye-Xin Lu, and Zhen-Hua Ling, “ESTVocoder: An excitation-spectral-transformed neural vocoder conditioned on mel spectrogram,” in Proc. NCMMSC, 2024, pp. 114-128.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “A neural denoising vocoder for clean waveform generation from noisy mel-spectrogram based on amplitude and phase predictions,” in Proc. NCMMSC, 2024, pp. 144-152.
Rui-Chen Zheng, Yang Ai*, and Zhen-Hua Ling, “Speech reconstruction from silent tongue and lip articulation by pseudo target generation and domain adversarial training,” in Proc. ICASSP, 2023, pp. 1-5.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” in Proc. Interspeech, 2023, pp. 3834-3838.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,” in Proc. NCMMSC, 2023, pp. 66-80.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “Source-filter-based generative adversarial neural vocoder for high fidelity speech synthesis,” in Proc. NCMMSC, 2022, pp. 68-80.
Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai*, Zhen-Hua Ling, “Incremental disentanglement for environment-aware zero-shot text-to-speech synthesis,” in Proc. ICASSP, 2025, pp. 1-5.
Yu Guan, Yang Ai*, Zuoliang Li, Shengyu Peng, Wu Guo, “Recursive feature learning from pre-trained models for spoofing speech detection,” in Proc. ICASSP, 2025, pp. 1-5.
Shengyu Peng, Wu Guo, Jie Zhang, Zuoliang Li, Yu Guan, Bin Gu, Yang Ai*, “A study of multi-scale feature learning from pre-trained models on speaker verification,” in Proc. ICASSP, 2025, pp. 1-5.
Zuoliang Li, Yang Ai*, Jie Zhang, Shengyu Peng, Yu Guan, Bin Gu, Wu Guo, “Aligning noisy-clean speech pairs at feature and embedding levels for learning noise-invariant speaker representations,” in Proc. ICASSP, 2025, pp. 1-5.
Yao Guo, Yang Ai*, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling, “Vision-integrated high-quality neural speech coding,” accepted by Interspeech, 2025.
Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai*, Zhen-Hua Ling, “Improving noise robustness of LLM-based zero-shot TTS via discrete acoustic token denoising,” accepted by Interspeech, 2025.
Yu-Fei Shi, Yang Ai*, Zhen-Hua Ling, “Universal preference-score-based pairwise speech quality assessment,” accepted by Interspeech, 2025.

其他论文列表

Rui-Chen Zheng, Yang Ai, and Zhen-Hua Ling, “Incorporating ultrasound tongue images for audio-visual speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1430–1444, 2024.
Shi-Ming Wang, Yang Ai, Li-Ping Chen, Ya-Jun Hu, and Zhen-Hua Ling, “TEAR: A Cross-modal Pre-trained Text Encoder Enhanced by Acoustic Representations for Speech Synthesis,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 24, no. 3, pp. 1–15, 2025.
Shi-Ming Wang, Li-Ping Chen, Yang Ai, Ya-Jun Hu, and Zhen-Hua Ling, “PhonemeVec: A Phoneme-level contextual prosody representation for speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 1117–1128, 2025.
Zheng-Yan Sheng, Li-Juan Liu, Yang Ai, Jia Pan, and Zhen-Hua Ling, “Voice attribute editing with text prompt,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 1641–1652, 2025.
Kang-Di Mei, Zhao-Ci Liu, Hui-Peng Du, Heng-Yu Li, Yang Ai, Li-Ping Chen, Zhen-Hua Ling, “Considering temporal connection between turns for conversational speech synthesis,” in Proc. ICASSP, 2024, pp. 11426-11430.
Heng-Yu Li, Kang-Di Mei, Zhao-Ci Liu, Yang Ai, Li-Ping Chen, Zhen-Hua Ling, “Refining self-supervised learnt speech representation using brain activations,” in Proc. Interspeech, 2024, pp. 1480-1484.
Yuan Jiang, Shun Bao, Ya-Jun Hu, Li-Juan Liu, Guo-Ping Hu, Yang Ai, and Zhen-Hua Ling, “Online speaker adaptation for WaveNet-based neural vocoders,” in Proc. ICDSP, 2024, pp. 112-117.
Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, and Zhen-Hua Ling, “Face-driven zero-shot voice conversion with memory-based face-voice alignment,” in Proc. ACM MM, 2023, pp. 8443-8452.
Rui-Chen Zheng, Yang Ai, and Zhen-Hua Ling, “Incorporating ultrasound tongue images for audio-visual speech enhancement through knowledge distillation,” in Proc. Interspeech, 2023, pp. 844-848.
Zheng-Yan Sheng, Yang Ai, and Zhen-Hua Ling, “Zero-shot personalized lip-to-speech synthesis with face image based voice control,” in Proc. ICASSP, 2023, pp. 1-5.
Hao-Chen Wu, Zhu-Hai Li, Lu-Zhen Xu, Zhen-Tao Zhang, Wen-Ting Zhao, Bin Gu, Yang Ai, Ye-Xin Lu, Jie-Zhang, Zhen-Hua Ling and Wu Guo, “The USTC-NERCSLIP system for the track 1.2 of audio deepfake detection (ADD 2023) challenge,” in Proc. IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis, 2023, pp. 119-124.
Hao-Jian Lin, Yang Ai, and Zhen-Hua Ling, “A light CNN with split batch normalization for spoofed speech detection using data augmentation,” in Proc. APSIPA, 1684 – 1689, 2022.
Han-Jie Guo, Hui-Peng Du, Zheng-Yan Sheng, Li-Ping Chen, Yang Ai, Zhen-Hua Ling, “CASC-XVC: Zero-shot cross-lingual voice conversion with content accordant and speaker contrastive losses,” accepted by ICASSP, 2025.
Yin-Long Liu, Rui Feng, Ye-Xin Lu, Jia-Xin Chen, Yang Ai, Jia-Hong Yuan, Zhen-Hua Ling, “Can automated speech recognition errors provide valuable clues for alzheimer’s disease detection?,” accepted by ICASSP, 2025.

2022年以前

论文列表

Yang Ai and Zhen-Hua Ling, “A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 839–851, 2020.
Yang Ai, Hong-Chuan Wu, and Zhen-Hua Ling, “SampleRNN-based neural vocoder for statistical parametric speech synthesis,” in Proc. ICASSP, 2018, pp. 5659-5663.
Yang Ai, Jing-Xuan Zhang, Liang Chen, and Zhen-Hua Ling, “DNN-based spectral enhancement for neural waveform generators with low-bit quantization,” in Proc. ICASSP, 2019, pp. 7025-7029.
Yang Ai and Zhen-Hua Ling, “Knowledge-and-data-driven amplitude spectrum prediction for hierarchical neural vocoders,” in Proc. Interspeech, 2020, pp. 190-194.
Yang Ai, Xin Wang, Junichi Yamagishi and Zhen-Hua Ling, “Reverberation modeling for source-filter-based neural vocoder,” in Proc. Interspeech, 2020, pp.3560-3564.
Yang Ai, Hao-Yu Li, Xin Wang, Junichi Yamagishi and Zhen-Hua Ling, “Denoising-and-dereverberation hierarchical neural vocoder for robust waveform generation,” in Proc. SLT, 2021, pp. 477-484.
Zhen-Hua Ling, Yang Ai, Yu Gu, and Li-Rong Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 5, pp. 883–894, 2018.
Yuan Jiang, Ya-Jun Hu, Li-Juan Liu, Hong-Chuan Wu, Zhi-Kun Wang, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “The USTC system for blizzard challenge 2019,”in Blizzard Challenge Workshop, 2019.
Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “Singing voice synthesis using deep autoregressive neural networks for acoustic modeling,” in Proc. Interspeech, 2019, pp. 2593–2597.
Qiu-Chen Huang, Yang Ai, and Zhen-Hua Ling, “Online speaker adaptation for WaveNet-based neural vocoders,” in Proc. APSIPA, 2020, pp. 815-820.
Hao-Yu Li, Yang Ai, and Junichi Yamagishi, “Enhancing low-quality voice recordings using disentangled channel factor and neural waveform model,” in Proc. SLT, 2021, pp. 2452-2456.
Chang Liu, Yang Ai, and Zhen-Hua Ling, “Phase spectrum recovery for enhancing low-quality speech captured by laser microphones,” in Proc. ISCSLP, 2021, pp. 1-5.
Kun Shao, Jun-An Yang, Yang Ai, Hui Liu and Yu Zhang, “BDDR: An effective defense against textual backdoor attacks,” Computers & Security, vol. 110, pp. 102433, 2021.