中国科学技术大学艾杨--中文主页--首页

艾杨

特任副研究员硕士生导师
教师英文名称：Yang Ai
电子邮箱：
学历：研究生(博士)毕业
学位：博士
毕业院校：中国科学技术大学
学科：信息与通信工程

访问量：

开通时间：..

最后更新时间：..

同专业硕导

个人简介

艾杨，现任中国科学技术大学信息科学技术学院电子工程与信息科学系特任副研究员，主要研究方向包括语音合成、语音增强、语音分离、音频编码和音频质量评价等，在语音领域顶刊IEEE TASLP及语音领域顶会ICASSP/Interspeech等上共发表论文60余篇。入选2024年度“小米青年学者”。

教育经历

2012年9月—2016年6月厦门大学通信工程专业学士
2016年9月—2021年6月中国科学技术大学信息与通信工程专业博士（导师：凌震华教授）

科研与学术工作经历

2020年2月—2020年8月日本国立情报学研究所联合培养博士生
2021年7月—2022年3月国防科技大学讲师
2022年4月—2023年12月中国科学技术大学博士后研究员
2024年1月至今中国科学技术大学特任副研究员

科研项目

主持项目

国家自然科学基金委员会，国家自然科学基金青年项目，面向语音生成的抗卷绕相位谱预测，2024-01 至 2026-12，30万元
安徽省科学技术厅，安徽省自然科学基金青年项目，结合相位预测的高质量高效率辅助式语音增强方法研究，2023-09 至 2025-08，8万元
中国科学技术大学, 青年创新基金, 高效率高鲁棒的神经网络声码器研究, 2023-01 至 2024-12，9万元

参与项目

科技部，科技部攻关项目子课题，智能语音移植模型和算法工具包研发，2022-01至2024-12，500万元（排名2/34）
国家自然科学基金委员会，国家自然科学基金联合项目，感知驱动的细粒度语音表征解耦与跨模态可控语音语音合成，2024-01 至 2027-12，260万元（排名7/21）
中科院, 战略性先导科技专项（C类）课题，多语种语音合成关键技术，2020-01 至 2022-12，1632万元（排名2/35）
科技部, 国家重点研发计划项目课题，面向冬奥场景的多语种语音处理关键技术，2019-10 至 2022-06，338万元 (3/31)
国家自然科学基金委员会, 国家自然科学基金面上项目，面向语音合成的神经网络声码器研究，2019-01-01 至 2022-12-31， 63万元（排名7/8）

论文发表

2022年及以后

第一作者+通讯作者论文列表

Yang Ai, Xiao-Hang Jiang, Ye-Xin Lu, Hui-Peng Du, and Zhen-Hua Ling*, “APCodec: A neural audio codec with parallel amplitude and phase spectrum encoding and decoding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 3256–3269, 2024.
Yang Ai, and Zhen-Hua Ling*, “Low-latency neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses for speech generation tasks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2283–2296, 2024.
Yang Ai, and Zhen-Hua Ling*, “APNet: An all-frame-level neural vocoder incorporating direct prediction of amplitude and phase spectra,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2145–2157, 2023.
Yang Ai*, Zhen-Hua Ling, Wei-Lu Wu and Ang Li, “Denoising-and-dereverberation hierarchical neural vocoder for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2036–2048, 2022.
Yang Ai, Ye-Xin Lu and Zhen-Hua Ling*, “Long-frame-shift neural speech phase prediction with spectral continuity enhancement and interpolation error compensation,” IEEE Signal Processing Letters, vol. 30, pp. 1097-1101, 2023.
Yang Ai, Ye-Xin Lu, Xiao-Hang Jiang, Zheng-Yan Sheng, Rui-Chen Zheng, and Zhen-Hua Ling*, “A low-bitrate neural audio codec framework with bandwidth reduction and recovery for high-sampling-rate waveforms,” in Proc. Interspeech, 2024, pp. 1765-1769.
Yang Ai, and Zhen-Hua Ling*, “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” in Proc. ICASSP, 2023, pp. 1-5.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement,” Neural Networks, vol. 189, pp. 107562, 2025.
Ye-Xin Lu, Yang Ai*, Hui-Peng Du, and Zhen-Hua Ling, “Towards high-quality and efficient speech bandwidth extension with parallel amplitude and phase prediction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 236–250, 2025.
Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Yang Ai*, and Zhen-Hua Ling, “ERVQ: Enhanced residual vector quantization with intra-and-inter-codebook optimization for neural audio codecs,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 2539–2550, 2025.
Xiao-Hang Jiang, Yang Ai*, Rui-Chen Zheng, and Zhen-Hua Ling, “A streamable neural audio codec with residual scalar-vector quantization for real-time communication,” IEEE Signal Processing Letters, vol. 32, pp. 1645–1649, 2025.
Fei Liu, Yang Ai*, and Zhen-Hua Ling, “Token-prediction-based post-processing for low-bitrate speech coding,” IEEE Signal Processing Letters, vol. 32, pp. 3235–3239, 2025.
Hui-Peng Du, Yang Ai*, Rui-Chen Zheng, Ye-Xin Lu, and Zhen-Hua Ling, “Is GAN necessary for mel-spectrogram-based neural vocoder,” IEEE Signal Processing Letters, vol. 32, pp. 3485–3489, 2025.
Rui-Chen Zheng, Yang Ai*, Zhen-Hua Ling, “Speech reconstruction from silent lip and tongue articulation by diffusion models and text-guided pseudo target generation,” in Proc. ACM MM, 2024, pp. 6559-6568.
Ye-Xin Lu, Yang Ai*, Zheng-Yan Sheng, and Zhen-Hua Ling, “Multi-stage speech bandwidth extension with flexible sampling rates control,” in Proc. Interspeech, 2024, pp. 2270-2274.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “BiVocoder: A bidirectional neural vocoder integrating feature extraction and waveform generation,” in Proc. Interspeech, 2024, pp. 3894-3898.
Fei Liu, Yang Ai*, Hui-Peng Du, Ye-Xin Lu, Rui-Chen Zheng, and Zhen-Hua Ling, “Stage-wise and prior-aware neural speech phase prediction,” in Proc. SLT, 2024, pp. 648-654.
Xiao-Hang Jiang, Yang Ai*, Rui-Chen Zheng, Hui-Peng Du, Ye-Xin Lu, and Zhen-Hua Ling, “MDCTCodec: A lightweight MDCT-based neural audio codec towards high sampling rate and low bitrate scenarios,” in Proc. SLT, 2024, pp. 550-557.
Yu-Fei Shi, Yang Ai*, Ye-Xin Lu, Hui-Peng Du, and Zhen-Hua Ling, “Pitch-and-spectrum-aware singing quality assessment with bias correction and model fusion,” in Proc. SLT, 2024, pp. 821-827.
Hui-Peng Du, Yang Ai*, Rui-Chen Zheng, and Zhen-Hua Ling, “APCodec+: A spectrum-coding-based high-fidelity and high-compression-rate neural audio codec with staged training paradigm,” in Proc. ISCSLP, 2024, pp. 676-680.
Yu-Fei Shi, Ye-Xin Lu, Yang Ai*, Hui-Peng Du, and Zhen-Hua Ling, “SAMOS: A neural MOS prediction model leveraging semantic representations and acoustic features,” in Proc. ISCSLP, 2024, pp. 199-203.
Xiao-Hang Jiang, Hui-Peng Du, Yang Ai*, Ye-Xin Lu, and Zhen-Hua Ling, “ESTVocoder: An excitation-spectral-transformed neural vocoder conditioned on mel spectrogram,” in Proc. NCMMSC, 2024, pp. 114-128.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “A neural denoising vocoder for clean waveform generation from noisy mel-spectrogram based on amplitude and phase predictions,” in Proc. NCMMSC, 2024, pp. 144-152.
Rui-Chen Zheng, Yang Ai*, and Zhen-Hua Ling, “Speech reconstruction from silent tongue and lip articulation by pseudo target generation and domain adversarial training,” in Proc. ICASSP, 2023, pp. 1-5.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” in Proc. Interspeech, 2023, pp. 3834-3838.
Hui-Peng Du, Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “APNet2: High-quality and high-efficiency neural vocoder with direct prediction of amplitude and phase spectra,” in Proc. NCMMSC, 2023, pp. 66-80.
Ye-Xin Lu, Yang Ai*, and Zhen-Hua Ling, “Source-filter-based generative adversarial neural vocoder for high fidelity speech synthesis,” in Proc. NCMMSC, 2022, pp. 68-80.
Ye-Xin Lu, Hui-Peng Du, Zheng-Yan Sheng, Yang Ai*, Zhen-Hua Ling, “Incremental disentanglement for environment-aware zero-shot text-to-speech synthesis,” in Proc. ICASSP, 2025, pp. 1-5.
Yu Guan, Yang Ai*, Zuoliang Li, Shengyu Peng, Wu Guo, “Recursive feature learning from pre-trained models for spoofing speech detection,” in Proc. ICASSP, 2025, pp. 1-5.
Shengyu Peng, Wu Guo, Jie Zhang, Zuoliang Li, Yu Guan, Bin Gu, Yang Ai*, “A study of multi-scale feature learning from pre-trained models on speaker verification,” in Proc. ICASSP, 2025, pp. 1-5.
Zuoliang Li, Yang Ai*, Jie Zhang, Shengyu Peng, Yu Guan, Bin Gu, Wu Guo, “Aligning noisy-clean speech pairs at feature and embedding levels for learning noise-invariant speaker representations,” in Proc. ICASSP, 2025, pp. 1-5.
Yao Guo, Yang Ai*, Rui-Chen Zheng, Hui-Peng Du, Xiao-Hang Jiang, Zhen-Hua Ling, “Vision-integrated high-quality neural speech coding,” in Proc. Interspeech, 2025, pp. 619-623.
Ye-Xin Lu, Hui-Peng Du, Fei Liu, Yang Ai*, Zhen-Hua Ling, “Improving noise robustness of LLM-based zero-shot TTS via discrete acoustic token denoising,” in Proc. Interspeech, 2025, pp. 2465-2469.
Yu-Fei Shi, Yang Ai*, Zhen-Hua Ling, “Universal preference-score-based pairwise speech quality assessment,” in Proc. Interspeech, 2025, pp. 2345-2349.
Fei Liu, Yang Ai*, Zhen-Hua Ling, “Neural speech separation with parallel amplitude and phase spectrum estimation,” in Proc. APSIPA ASC, 2025, pp. 642-647.
Hui-Peng Du, Yang Ai*, Zhen-Hua Ling, “A distilled low-latency neural vocoder with explicit amplitude and phase prediction,” in Proc. APSIPA ASC, 2025, pp. 483-488.
En-Wei Zhang, Hui-Peng Du, Xiao-Hang Jiang, Yang Ai*, Zhen-Hua Ling, “A high-quality and low-complexity streamable neural speech codec with knowledge distillation,” in Proc. APSIPA ASC, 2025, pp. 676-681.

其他论文列表

Rui-Chen Zheng, Yang Ai, and Zhen-Hua Ling, “Incorporating ultrasound tongue images for audio-visual speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1430–1444, 2024.
Shi-Ming Wang, Yang Ai, Li-Ping Chen, Ya-Jun Hu, and Zhen-Hua Ling, “TEAR: A Cross-modal Pre-trained Text Encoder Enhanced by Acoustic Representations for Speech Synthesis,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 24, no. 3, pp. 1–15, 2025.
Shi-Ming Wang, Li-Ping Chen, Yang Ai, Ya-Jun Hu, and Zhen-Hua Ling, “PhonemeVec: A Phoneme-level contextual prosody representation for speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 1117–1128, 2025.
Zheng-Yan Sheng, Li-Juan Liu, Yang Ai, Jia Pan, and Zhen-Hua Ling, “Voice attribute editing with text prompt,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 1641–1652, 2025.
Kang-Di Mei, Zhao-Ci Liu, Hui-Peng Du, Heng-Yu Li, Yang Ai, Li-Ping Chen, Zhen-Hua Ling, “Considering temporal connection between turns for conversational speech synthesis,” in Proc. ICASSP, 2024, pp. 11426-11430.
Heng-Yu Li, Kang-Di Mei, Zhao-Ci Liu, Yang Ai, Li-Ping Chen, Zhen-Hua Ling, “Refining self-supervised learnt speech representation using brain activations,” in Proc. Interspeech, 2024, pp. 1480-1484.
Yuan Jiang, Shun Bao, Ya-Jun Hu, Li-Juan Liu, Guo-Ping Hu, Yang Ai, and Zhen-Hua Ling, “Online speaker adaptation for WaveNet-based neural vocoders,” in Proc. ICDSP, 2024, pp. 112-117.
Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, and Zhen-Hua Ling, “Face-driven zero-shot voice conversion with memory-based face-voice alignment,” in Proc. ACM MM, 2023, pp. 8443-8452.
Rui-Chen Zheng, Yang Ai, and Zhen-Hua Ling, “Incorporating ultrasound tongue images for audio-visual speech enhancement through knowledge distillation,” in Proc. Interspeech, 2023, pp. 844-848.
Zheng-Yan Sheng, Yang Ai, and Zhen-Hua Ling, “Zero-shot personalized lip-to-speech synthesis with face image based voice control,” in Proc. ICASSP, 2023, pp. 1-5.
Hao-Chen Wu, Zhu-Hai Li, Lu-Zhen Xu, Zhen-Tao Zhang, Wen-Ting Zhao, Bin Gu, Yang Ai, Ye-Xin Lu, Jie-Zhang, Zhen-Hua Ling and Wu Guo, “The USTC-NERCSLIP system for the track 1.2 of audio deepfake detection (ADD 2023) challenge,” in Proc. IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis, 2023, pp. 119-124.
Hao-Jian Lin, Yang Ai, and Zhen-Hua Ling, “A light CNN with split batch normalization for spoofed speech detection using data augmentation,” in Proc. APSIPA ASC, 1684 – 1689, 2022.
Han-Jie Guo, Hui-Peng Du, Zheng-Yan Sheng, Li-Ping Chen, Yang Ai, Zhen-Hua Ling, “CASC-XVC: Zero-shot cross-lingual voice conversion with content accordant and speaker contrastive losses,” accepted by ICASSP, 2025.
Yin-Long Liu, Rui Feng, Ye-Xin Lu, Jia-Xin Chen, Yang Ai, Jia-Hong Yuan, Zhen-Hua Ling, “Can automated speech recognition errors provide valuable clues for alzheimer’s disease detection?,” accepted by ICASSP, 2025.

2022年以前

论文列表

Yang Ai and Zhen-Hua Ling, “A neural vocoder with hierarchical generation of amplitude and phase spectra for statistical parametric speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 839–851, 2020.
Yang Ai, Hong-Chuan Wu, and Zhen-Hua Ling, “SampleRNN-based neural vocoder for statistical parametric speech synthesis,” in Proc. ICASSP, 2018, pp. 5659-5663.
Yang Ai, Jing-Xuan Zhang, Liang Chen, and Zhen-Hua Ling, “DNN-based spectral enhancement for neural waveform generators with low-bit quantization,” in Proc. ICASSP, 2019, pp. 7025-7029.
Yang Ai and Zhen-Hua Ling, “Knowledge-and-data-driven amplitude spectrum prediction for hierarchical neural vocoders,” in Proc. Interspeech, 2020, pp. 190-194.
Yang Ai, Xin Wang, Junichi Yamagishi and Zhen-Hua Ling, “Reverberation modeling for source-filter-based neural vocoder,” in Proc. Interspeech, 2020, pp.3560-3564.
Yang Ai, Hao-Yu Li, Xin Wang, Junichi Yamagishi and Zhen-Hua Ling, “Denoising-and-dereverberation hierarchical neural vocoder for robust waveform generation,” in Proc. SLT, 2021, pp. 477-484.
Zhen-Hua Ling, Yang Ai, Yu Gu, and Li-Rong Dai, “Waveform modeling and generation using hierarchical recurrent neural networks for speech bandwidth extension,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 5, pp. 883–894, 2018.
Yuan Jiang, Ya-Jun Hu, Li-Juan Liu, Hong-Chuan Wu, Zhi-Kun Wang, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “The USTC system for blizzard challenge 2019,”in Blizzard Challenge Workshop, 2019.
Yuan-Hao Yi, Yang Ai, Zhen-Hua Ling, and Li-Rong Dai, “Singing voice synthesis using deep autoregressive neural networks for acoustic modeling,” in Proc. Interspeech, 2019, pp. 2593–2597.
Qiu-Chen Huang, Yang Ai, and Zhen-Hua Ling, “Online speaker adaptation for WaveNet-based neural vocoders,” in Proc. APSIPA, 2020, pp. 815-820.
Hao-Yu Li, Yang Ai, and Junichi Yamagishi, “Enhancing low-quality voice recordings using disentangled channel factor and neural waveform model,” in Proc. SLT, 2021, pp. 2452-2456.
Chang Liu, Yang Ai, and Zhen-Hua Ling, “Phase spectrum recovery for enhancing low-quality speech captured by laser microphones,” in Proc. ISCSLP, 2021, pp. 1-5.
Kun Shao, Jun-An Yang, Yang Ai, Hui Liu and Yu Zhang, “BDDR: An effective defense against textual backdoor attacks,” Computers & Security, vol. 110, pp. 102433, 2021.

专利发表

已授权专利

艾杨; 凌震华; 基于短时谱一致性的神经网络声码器训练方法, 2024-03-29, 中国, ZL 2020 1 1482467.6

已受理专利

艾杨; 凌震华; 利用抗卷绕损失训练的平行估计架构网络预测相位的方法, 2022-11-25, 中国, 202211489291.6
艾杨; 凌震华; 一种声码器的构建方法、语音合成方法及相关装置, 2023-1-16, 中国, 202310081092.X
艾杨；鲁叶欣；凌震华；一种长帧移语音相位谱预测方法及装置，2023-6-19，中国，202310737506.X
艾杨；盛峥彦；郑瑞晨；鲁叶欣；江晓航；凌震华；一种语音通信系统及方法，2023-11-13，中国，202311498981.2
艾杨；江晓航；郑瑞晨；鲁叶欣；凌震华；音频处理方法、装置、存储介质和电子设备，2024-04-11，中国，202410438079X
鲁叶欣；艾杨；凌震华；语音增强方法及装置，2023-5-17，中国，2023105730480
鲁叶欣；艾杨；杜荟鹏；凌震华；一种语音波形的扩展方法、装置、设备及存储介质，2024-1-10，2024100399941

获奖和荣誉情况

2024年入选“小米青年学者”
Interspeech 2024离散语音挑战赛（Discrete Speech Challenge）高采样率声码器赛道冠军（第一完成人）
2024年声音质量评价挑战赛（VoiceMOS Challenge）赛道2冠军（指导教师，论文通讯作者）
2024年IEEE Spoken Language Technology Workshop 2024国际会议最佳论文候选（论文通讯作者）
2023年第十八届全国人机语音通讯学术会议最佳论文奖（指导教师，论文通讯作者）
2023年伪造音频检测挑战赛（Audio Deepfake Detection Challenge）赛道1.2冠军
2022年产学研合作创新成果奖二等奖

其他联系方式

邮编 :

通讯/办公地址 :

移动电话 :

邮箱 :

教育经历

[1] 2016.9 -- 2021.6
中国科学技术大学信息与通信工程研究生博士

[2] 2012.9 -- 2016.6
厦门大学通信工程本科学士

工作经历

[1] 2024.1 -- 至今
中国科学技术大学信息科学技术学院特任副研究员

[2] 2022.4 -- 2023.12
中国科学技术大学信息科学技术学院博士后研究员

[3] 2021.7 -- 2022.3
国防科技大学电子对抗学院讲师