Yunxin Li

😎 About Me (李云鑫)

I am an Associate Professor at Harbin Institute of Technology, Shenzhen. I earned a PhD from the Harbin Institute of Technology, Shenzhen, advised by Prof. Baotian Hu, Prof. Yuxin Ding, and Prof. Min Zhang. I obtained a Master of Engineering degree from Harbin Institute of Technology, Shenzhen and a Bachelor of Science degree from Harbin Institute of Technology. Long-term cooperation with Dr. Lin Ma, Meituan; Prof. Wenhan Luo, HKUST; Dr. Longyue Wang, Alibaba International Group; Dr. Yuxiang Wu, Weco AI.

The long-term goal of my research is to help humans with more capable artificial intelligence. Dream of building an intelligent metaverse and interesting research directions, including:

Omnimodal Reasoning and Planning
Omnimodal Large Model
Omnimodal Tool Use
Omnimodal Agent

I am actively seeking cooperators (本科生、硕士生、博士生) who share my interest in developing large multimodal reasoning models to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.

哈工深张民教授以自然语言为核心的立知大模型团队长期招收大模型硕博研究生、本硕博实习生。团队可依靠哈工深、深圳河套人工智能学院、鹏城国家实验室、苏州大学等高校和实验室联合培养，做有价值的科研，欢迎积极踊跃报名！
Tips: 深圳河套学院博士和RA招生，合作导师为哈工深张民教授和户保田教授、港科罗文寒教授，26年招生时间: 3月底/4月初春令营、6月夏令营、10月中旬秋令营

📧Email: liyx@hit.edu.cn

🔥 News

2026.07: ✨ Our multimodal and speech team’s full-duplex spoken language model wins the ACL 2026 Outstanding Paper Award 🏆.
2026.05: ✨ Our team has seven papers accepted by ACL 2026, which consist of Multimodal Spoken Large Language Models, Tool Use, AIGC/GUI Agents.
2026.01: ✨ The Vision Enhancing LLMs Open Eyes to learn language is accepted by IEEE TIP
2025.11: ✨ The omnimodal large model Uni-MoE-2.0-Omni models (five checkpoints) are open-sourced
2025.10: ✨ Our unified speech and music generative model Uni-MoE-Audio is open-sourced
2025.10: ✨ One long paper about the Temporal RAG Benchmark is accepted by Nature Scientific Data
2025.08: ✨ One long paper about Temporal RAG is accepted by CIKM 2025
2025.08: ✨ The long video generation work Animaker is accepted by ACM SIGGRAPH Asia 2025
2025.05: ✨ Unified multimodal LLMs Uni-MoE is accepted by IEEE TPAMI 2025
2025.05: ✨ VideoVista-CulturalLingo is accepted by ACL 2025 Main Conference
2025.01: ✨ Pioneering GUI model UI-TARS is open-sourced
2024.11: ✨ Anim-Director is accepted by ACM SIGGRAPH Asia 2024
2024.05: ✨ Cognitive Visual-Knowledge Alignment work is accepted by ACL 2024 Main Conference
2024.04: ✨ VisionGraph is accepted by ICML 2024 Main Conference
2024.04: ✨ Multimodal LLMs LMEye is accepted by IEEE TMM 2024
2024.02: ✨ Multimodal E-commerce model is accepted by LREC-COLING 2024
2023.08: ✨ Multimodal Event Extraction work is accepted by ACM MM 2023
2023.05: ✨ Two multimodal reasoning papers are accepted by ACL 2023 Main Conference
2022.08: ✨ Chunk-aware reasoning work is accepted by ACM MM 2022
2022.05: ✨ Pivotal information recalling-based medical dialogue generation is accepted by SIGKDD
2022.03: ✨ Deep Spatial & Contextual Information network is accepted by IEEE TMM 2023

📕 Selected Publications

[IEEE TIP 2026] Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs
Yunxin Li, Zhenyu Liu, Baotian Hu, Wei Wang, Yuxin Ding, Xiaochun Cao, Min Zhang
[pdf] [code]
[Technical Report, ArXiv 2025] Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data
Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang
[pdf] [web] [code]
[Survey, ArXiv 2025] Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
[pdf] [web] [huggingface]
[IEEE TPAMI 2025] Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts
Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang
[pdf] [web] [code]
[ACL 2025] VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, Min Zhang
[pdf] [web] [code]
[SIGGRAPH Asia 2024] Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang
[pdf] [code]
[ACL 2024] Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
Yunxin Li, Xinyu Chen, Baotian Hu, Haoyuan Shi, Min Zhang
[pdf] [code]
[ICML 2024] VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context
Yunxin Li, Baotian Hu, Haoyuan Shi, Wei Wang, Longyue Wang, Min Zhang
[pdf] [code]
[IEEE TMM 2024] LMEye: An Interactive Perception Network for Large Language Models
Yunxin Li, Baotian Hu, Xinyu Chen, Lin Ma, Yong Xu, Min Zhang
[pdf] [code]
[LREC-COLING 2024] A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation
Yunxin Li, Baotian Hu, Wenhan Luo, Lin Ma, Yuxin Ding, Min Zhang
[pdf] [code]
[ACM MM 2023] Training Multimedia Event Extraction With Generated Images and Captions
Zilin Du, Yunxin Li, Xu Guo, Yidan Sun, Boyang Li
[pdf] [code]
[ACL 2023] A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text
Yunxin Li, Baotian Hu, Yuxin Ding, Lin Ma, Min Zhang
[pdf] [code]
[ACL 2023] A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues
Yunxin Li, Baotian Hu, Xinyu Chen, Yuxin Ding, Lin Ma, Min Zhang
[pdf] [code]
[ACM MM 2022] Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations
Qian Yang#, Yunxin Li#, Baotian Hu, Lin Ma, Yuxing Ding, Min Zhang
[pdf] [code]
[SIGKDD 2022] Medical Dialogue Response Generation with Pivotal Information Recalling
Yu Zhao#, Yunxin Li#, Yuxiang Wu, Baotian Hu, Qingcai Chen, Xiaolong Wang, Yuxin Ding, Min Zhang
[pdf] [code]
[IEEE TMM 2022] Fast and Robust Online Handwritten Chinese Character Recognition with Deep Spatial & Contextual Information Fusion Network
Yunxin Li, Qian Yang, Qingcai Chen, Baotian Hu, Xiaolong Wang, Yuxin Ding, Lin Ma
[pdf]

🏅 Award

Provincial Outstanding Graduates, 2019
National Scholarship (2018, 2021, 2024)
Baidu Scholarship (Global Top 40), 2024
Huawei TopMinds (华为天才少年), 2025
JD TGT (京东顶尖技术人才计划), 2025
Tencent Qingyun (腾讯青云人才计划), 2025
Young Talent Support Project-Doctor (首届中国科协青托博士生计划), CAST, 2024
Outstanding Doctoral Dissertation of HIT (哈工大优博), 2025

💼 Research Experience

HKUST Research Assistant (2025.03 - 2025.08)
ByteDance Doubao (Seed) Team (2024.10 - 2025.02)
Tencent AILab (2024.04 - 2024.08)
Tencent PCG (2021.10 - 2022.06)

💁 Service

Conference Reviewer: ACL ARR (2023-), ICLR (2023-), NeurIPS (2024-), ICML (2024-), AAAI (2024-), ACM SIGGRAPH (2025-), CVPR (2025-), ACM MM (2023-), and IJCAI (2023-).
Journal Reviewer: ACM Computing Survey, IEEE TPAMI, IEEE TIP, IEEE TMM, IEEE TNNLS, IEEE TCSVT, IEEE TAI, and Neural Networks.