
My personal homepage http://faculty.ustc.edu.cn/xiaojunbin/zh_CN/index.htm
肖俊斌,现任中国科学技术大学信息科学技术学院特任教授,博士生导师,国家优秀青年科学基金项目(海外)获得者。2023年博士毕业于新加坡国立大学计算机科学系(NExT++实验室),并留校从事博士后研究工作至2026年2月。师从人工智能与多媒体领域著名学者蔡达成(Chua Tat-Seng)教授,并与Angela Yao教授紧密合作。读博期间曾在新加坡Sea AI Lab实习,指导老师颜水成教授。此前,分别于中国科学院计算技术研究所(保送)及四川大学取得硕士和学士学位。其研究聚焦于跨模态视频理解与问答辅助系统,近年来在相关领域国际顶级会议及期刊(如CVPR, ICCV, ECCV, NeurIPS, ICML, TPAMI, IJCV等)发表文章40余篇,其中一作+通讯20余篇,多次获得顶会Oral(AAAI'22)、 Spotlight(ECCV'22)、 Highlight(CVPR'24),获得CVPR'22 最佳论文候选等。研究成果被国际顶级学术机构(如Standford、UC Berkeley、MIT、CMU、Oxford)与企业机构(Google、DeepMind、Microsoft、Meta AI)引用并采纳。目前谷歌学术引用3000余次,H-Index 21, 引用者含斯坦福李飞飞等60余位院士或IEEE/ACM/AAAI Fellow。受邀担任著名学术会议CVPR'26领域主席,并长期担任人工智能相关领域国际会议及期刊审稿人。
研究:致力于研究和开发能够理解物理世界,并与人类进行交互、沟通、和协作,从而提供个性化辅助的AI技术。以此目标,围绕视频多模态智能开展多模态大模型基础研究(可靠性、可行性、高效性)及应用研究(第一视角具身交互,长/流式视频理解,无人机多模态智能等)。具体而言,研究聚焦于“跨模态视频理解与问答”这一多模态智能核心任务,既以视频问答为任务形式研究跨模态视频理解(包含时序与空间智能等)、多模态智能技术,又以视频问答为任务本身研究面向实际具身辅助的第一视角视觉感知、用户意图推理、个性化捕捉、流式视觉环境下辅助的适时性、高效性、可信赖性、可行性等。致力于在多模态大模型框架下实现将视频问答从“离线快照式”视觉理解推向“在线伴随式具身辅助”。目前正积极推进研究在可穿戴/便携式智能辅助设备(如智能眼镜),无人机巡检、游戏/体育等直播类视频理解上的应用。
招生:依托中科大信息科学技术学院6系招生,通常每年有1个博士生指标,3个保研和1个统考硕士指标。欢迎对多模态人工智能、具身交互智能、视频理解与分析等方向感兴趣的同学联系。期待你:有想法、有能力、有态度,对研究有追求、有品位, Thinking beyond papers。 基本要求:计算机、信息科学、人工智能、自动化、电子信息类专业,成绩靠前(可保研)、英语6级or相当水平、德/智/体兼修。 有科研训练、信息类竞赛经验者优先考虑。最终录取同学需通过未来媒体计算实验室统一考核。
论文(详情见谷歌学术主页):
[1] Wang, Y., Xiao, J., Lyu, H., Wang, Y., Zuo, J., Zhang, Z., Huang, H., Wu, D., and Yao, A. Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams. International Conference on Machine Learning (ICML), 2026.
[2] Xiao, J., Zhang, S., Zhu, P., and Yao, A. Ego-grounding for personalized question-answering in egocentric videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.
[3]Xiao, J., Chen, J., Sun, T., Yang, X., and Yao, A. MuKV: Multi-grained KV cache compression for long streaming video question-answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.
[4] Li, L.-L., Fang, J., Xiao, J., Yu, H., Lv, C., Xue, J., Li, Z., and Chua, T.-S. ADVersa: Abductive driving accident video understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026.
[5] Jung, M., Xiao, J., Zhang, B.-T., and Yao, A. On the consistency of video large language models in temporal comprehension. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
[6] Zhou, S., Xiao, J., Li, Q., Li, Y., Yang, X., Guo, D., Wang, M., Chua, T.-S., and Yao, A. EgoTextVQA: Towards egocentric scene-text aware video question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
[7] Sun, P., Xiao, J., Tse, T. H. E., Li, Y., Akula, A., and Yao, A. Visual intention grounding for egocentric assistants. IEEE/CVF International Conference on Computer Vision (ICCV), 2025.
[8] Li, L., Fang, J., Xiao, J., Pang, S., Yu, H., Lv, C., Xue, J., and Chua, T.-S. Causal-entity reflected egocentric traffic accident video synthesis. IEEE/CVF International Conference on Computer Vision (ICCV), 2025.
[9] Xiao, J., Li, Q., Yang, Y., Qiu, L., and Yao, A. Unleashing the power of large language models for medical video answer localization. Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025.
[10] Zhao, Y., Zhang, R., Xiao, J., Ke, C., Hou, R., Hao, Y., and Li, L. Towards analyzing and mitigating sycophancy in large vision-language models. Neurocomputing, 2025.
[11] Li, Y., Chen, Y., Ma, Z., Xiao, J., Wang, X., and Yao, A. Intermediate connectors and geometric priors for language-guided affordance segmentation. IEEE/CVF International Conference on Computer Vision (ICCV), 2025.
[12] Xiao, J., Huang, N., Qin, H., Li, D., Li, Y., Zhu, F., Tao, Z., Yu, J., Lin, L., and Chua, T.-S. VideoQA in the era of LLMs: An empirical study. International Journal of Computer Vision (IJCV), 2025.
[13] Qin, H., Xiao, J., Yao, A. Question-answering dense video events. International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 884–894, 2025.
[14] Zhou, S., Xiao, J., Yang, X., Song, P., Guo, D., Yao, A., Wang, M., and Chua, T.-S. Scene-text grounding for text-based video question answering. IEEE Transactions on Multimedia (TMM), 2025.
[15] Wang, J., Ma, Z., Cao, D., Le, Y., Xiao, J., and Chua, T.-S. Deconfounded multimodal learning for spatio-temporal video grounding. ACM International Conference on Multimedia (ACM MM), pp. 7521–7529, 2023.
[16] Xiao, J., Yao, A., Li, Y., and Chua, T.-S. Can I trust your answer? Visually grounded video question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13204–13214, 2024.
[17] Fang, J., Li, L., Zhou, J., Xiao, J., Yu, H., Lv, C., Xue, J., and Chua, T.-S. Abductive ego-view accident video understanding for safe driving perception. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22030–22040, 2024.
[18] Li, Y., Zhao, N., Xiao, J., Feng, C., Wang, X., and Chua, T.-S. LASO: Language-guided affordance segmentation on 3D object. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14251–14260, 2024.
[19] Nguyen, T., Bin, Y., Xiao, J., Qu, L., Li, Y., Wu, J. Z., Nguyen, C.-D., Ng, S.-K., and Tuan, L. Video-language understanding: A survey from model architecture, training, and data perspectives. Annual Meeting of the Association for Computational Linguistics (ACL Findings), pp. 3636–3657, 2024.
[20] Wang, J., Cao, D., Lu, S., Ma, Z., Xiao, J., and Chua, T.-S. Causal-driven large language models with faithful reasoning for knowledge question answering. ACM International Conference on Multimedia (ACM MM), pp. 4331–4340, 2024.
[21] Qi, P., Bu, Y., Cao, J., Ji, W., Shui, R., Xiao, J., Wang, D., and Chua, T.-S. FakeSV: A multimodal benchmark for fake news detection on short video platforms. AAAI Conference on Artificial Intelligence (AAAI), pp. 14444–14452, 2023.
[22] Zhu, F., Li, M., Xiao, J., Feng, F., Wang, C., and Chua, T.-S. Soargraph: Numerical reasoning over financial table-text data via semantic-oriented hierarchical graphs. Companion Proceedings of the ACM Web Conference (WWW), pp. 1236–1244, 2023.
[23] Li, Y., Feng, C., Wang, X., Xiao, J., and Chua, T.-S. Discovering spatio-temporal rationales for video question answering. IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13869–13878, 2023.
[24] Li, Y., Wang, X., Ji, W., Xiao, J., and Chua, T.-S. Transformer-empowered invariant grounding for video question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 9510-9522, 2023.
[25] Xiao, J., Zhou, P., Yao, A., Li, Y., Hong, R., Yan, S., and Chua, T.-S. Contrastive video question answering via video graph transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), pp. 13265 - 13280, 2023.
[26] Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., and Chua, T.-S. Video as conditional graph hierarchy for multi-granular question answering. AAAI Conference on Artificial Intelligence (AAAI), pp. 2804–2812, 2022.
[27] Zhong, Y., Xiao, J., Ji, W., Li, Y., Deng, W., and Chua, T.-S. Video question answering: Datasets, algorithms and challenges. Empirical Methods in Natural Language Processing (EMNLP), pp. 6439–6455, 2022.
[28] Li, Y., Wang, X., Xiao, J., Ji, W., and Chua, T.-S. Invariant grounding for video question answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2928–2937, 2022.
[29] Li, Y., Wang, X., Xiao, J., and Chua, T.-S. Equivariant and invariant grounding for video question answering. ACM International Conference on Multimedia (ACM MM), pp. 4714–4722, 2022.
[30] Xiao, J., Zhou, P., Chua, T.-S., and Yan, S. Video graph transformer for video question answering. European Conference on Computer Vision (ECCV), 2022.
[31] Xiao, J., Shang, X., Yao, A., and Chua, T.-S. NExT-QA: Next phase of question-answering to explaining temporal actions. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9777–9786, 2021.
[32] Shang, X., Li, Y., Xiao, J., Ji, W., and Chua, T.-S. Video visual relation detection via iterative inference. ACM International Conference on Multimedia (ACM MM), pp. 3654–3663, 2021.
[33] Xiao, J., Shang, X., Yang, X., Tang, S., and Chua, T.-S. Visual relation grounding in videos. European Conference on Computer Vision (ECCV), pp. 447–464, 2020.
四川大学  计算机科学与技术  本科  Bachelor's Degree in Engineering
中国科学院计算技术研究所  计算机应用技术  With Certificate of Graduation for Study as Master's Candidates  学术硕士
新加坡国立大学  计算机科学  With Certificate of Graduation for Doctorate Study  Doctoral Degree in Philosophy
新加坡国立大学 计算机科学系 博士后研究员
中国科学技术大学 信息科学技术学院 特任教授(博导)