Multimodal Foundation Models
Developing efficient and robust multimodal foundation models that integrate vision, language, speech, and actions. Research topics include cross-modal representation learning, alignment, reasoning, and generation, with an emphasis on open-world generalization and knowledge-enhanced modeling for diverse downstream tasks.