Enhancing Visual Reasoning In Vision-Language Models Vlms Through Dynamic Visual Feature Selection H/F

CEA | 27 Oct 2024

Les missions du poste

Generative Vision Language Models (VLMs) combine the understanding and generation of text in visual contexts. These models have demonstrated impressive performance on real-world visual question answering (VQA) benchmarks, suggesting their visual reasoning abilities. However, these benchmarks often mix pure visual reasoning tasks with tests of world knowledge, and typically involve questions requiring only a limited number of reasoning steps [2]. As a result, IT is unclear whether a VLM's apparent success in visual reasoning tasks is truly due to its reasoning capabilities or simply leveraging its extensive world knowledge. Moreover, VLMs often struggle with fine-grained scene understanding and spatial reasoning, largely due to inefficient use of visual features [5].

This internship aims to tackle these limitations by developing a novel approach for VLMs, particularly those trained through instruction learning methods like LLAVA [1]. This architecture involves converting visual features, from a Visual Transformer model [3], into text embedding space before feeding them to a large language model (LLM) for text generation.
We propose leveraging the Chain-of-Thought (CoT) technique [4] to iteratively select the most relevant visual features during the text generation process. CoT involves generating step-by-step reasoning to break down complex tasks into simpler logical steps, which enhances model performance on tasks requiring complex reasoning. In our approach, we will begin by linking the reasoning steps in a textual chain to specific visual features within the image to provide a visual justification for each step. Afterward, the model will learn to directly select and process relevant visual features without relying on explicit textual reasoning steps, allowing for a more intuitive and efficient understanding of the visual context.
[1] Liu, H., Li, C., Wu, Q., & Lee, Y.J. (2023). Visual Instruction Tuning. ArXiv, abs/2304.08485
[2] Zhang, Y., Bai, H., Zhang, R., Gu, J., Zhai, S., Susskind, J., & Jaitly, N. (2024). How far are we from intelligent visual deductive reasoning? In COLM
[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words : Transformers for Image Recognition at Scale. ArXiv, abs/2010.11929
[4] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.H., Xia, F., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv, abs/2201.11903
[5] Zhang, J., Hu, J., Khayatkhoei, M., Ilievski, F., & Sun, M. (2024). Exploring Perceptual Limitation of Multimodal Large Language Models. ArXiv, abs/2402.07384

L'adresse du poste

Localisez l'entreprise et calculez votre temps de trajet en un clic !

Le profil recherché

Students in their 5th year of studies (M2 or gap year)
Computer vision skills
Machine learning skills (deep learning, perception models, generative AI...)
Python proficiency in a deep learning framework (especially TensorFlow or PyTorch)
Scientific research experience will BE appreciated

In line with CEA's commitment to integrating people with disabilities, this job is open to all.

Bienvenue chez CEA

Le CEA est un acteur majeur de la recherche, au service des citoyens, de l'économie et de l'Etat.

Il apporte des solutions concrètes à leurs besoins dans quatre domaines principaux : transition énergétique, transition numérique, technologies pour la médecine du futur, défense et sécurité sur un socle de recherche fondamentale. Le CEA s'engage depuis plus de 75 ans au service de la souveraineté scientifique, technologique et industrielle de la France et de l'Europe pour un présent et un avenir mieux maîtrisés et plus sûrs.

Implanté au coeur des territoires équipés de très grandes infrastructures de recherche, le CEA dispose d'un large éventail de partenaires académiques et industriels en France, en Europe et à l'international.

Les 20 000 collaboratrices et collaborateurs du CEA partagent trois valeurs fondamentales :

- La conscience des responsabilités
- La coopération
- La curiosité

Finalisez votre candidature

sur le site du recruteur.