The Emergence of Multimodal AI: GPT-4V, LLaVa, DALL-E and What's Next?

The Emergence of Multimodal AI: GPT-4V, LLaVa, DALL-E and What's Next?

By Albert Mao

Jan 25, 2024

This article offers insight into the evolution of multimodal AI, analyzing practical applications of multisensory input, summarizing the development of GPT-4V, LLaVa, and DALL-E capabilities, and outlining research efforts to further enhance MLLM performance.

As large language models demonstrate human-like reasoning on a wide range of tasks, their perception has so far been limited to textual prompts. Meanwhile, enabling neural models to receive multimodal information, including images and audio, not only makes their interface more user-friendly but also significantly expands the scope of potential AI applications.

Multimodal large language models (MLLM) combine AI with the ability to receive and process visual, audio and other types of data, making neural networks even more human-like and intelligent. By adding the ability to perceive multisensory information, MLLMs transform AI from task-specific models into intelligent general-purpose assistants.

Use Cases of Multimodal Models

By adding the ability to process and interpret images and other types of data, multimodal language models mimic human cognitive processes, expanding the application of AI in practical settings. For example, MLLM can significantly expand the functionality of AI-powered support chatbots, allowing users to feed photos and images as prompts when turning for help, inquiring for information or requesting guidance. 

By adding the ability to receive multimodal information, MLLM will be a more rounded task solvers supporting a much wider spectrum of tasks, for example:

  • creating captions based on images,

  • making medical diagnostics based on visual data,

  • developing assistive technologies for impaired individuals

  • optimizing UI of websites and coding,

  • solving math problems based on diagrams, graphs and charts.

Overview of Multimodal Large Language Models

The lineup of available multimodal large language models is constantly developing and expanding. Below, we review the three most popular models allowing multisensory input, including GPT-4V, LLaVa and DALL-E.


At the moment of publishing this report, GPT-4 with vision, aka GPT-4V, is still the only MLLM available for commercial use. The model expands beyond textual prompts and is capable of generating output based on images and visual data. 

The results of tests demonstrated in the GPT-4V(ision) System Card revealed that the model can capture complex information in images, including specialized imagery, such as scientific publications and diagrams. The capabilities of GPT-4V provide for multiple applications, for example, creating code for website wireframes based on sketches.

Meanwhile, the GPT-4 with vision has its limitations, which include inconsistencies in interpretations of certain visuals, such as medical imaging, creating ungrounded inferences about places or people, and other vulnerabilities in model performance. The future efforts for GPT-4V development are focused on obtaining higher precision in handling image uploads, enhancing image recognition capabilities, advancing the approach for handling sensitive information and mitigating existing vulnerabilities.


Combining vision encoders and Vicuna large language model, LLaVA (Large Language-and-Vision Assistant) is an open source publically available model hosted with HuggingFace providing for collaboration for developers and researchers. The model achieves an 85.1% relative score compared with GPT-4 on a multimodal dataset. Meanwhile, the synergy of LLaVa with GPT-4 provides for 92.53% accuracy on the ScienceQA benchmark.


 Figure 1. LLaVa and LLaVa+GPT-4 performance on ScienceQA benchmark. Source:


In January 2021, OpenAI introduced its DALL-E model designed to create images based on text description. Focused on image generation, the model allowed to creation of realistic images combining concepts, attributes and styles. Although DALL-E, as well as DALL-E 2 and DALL-E 3, are not multimodal language models in a traditional sense, the newer versions allow editing images and generating variations which is based on the ability to interpret image-based prompts. 

Research on Multimodal Models

Yin et al. [2023]

In a frequently cited work by Yin et al. [2023], titled A Survey on Multimodal Large Language Models, a group of researchers from the University of Science and Technology of China (USTC) provide a comprehensive summary of the progress in developing multimodal large language models. The team discusses key approaches, including Multimodal Instruction Tuning (M-IT), Multimodal In-Context Learning (M-ICL), Multimodal Chain of Thought (M-CoT) and LLM-Aided Visual Reasoning (LAVR). 

Figure 2: The techniques and applications for developing multimodal large language models, described in the summary by Yin et al. [2023]. Source: BradyFu

Being one of the first published surveys on MLLMs, the work is an important milestone in mapping the evolution of multimodal AI. Meanwhile, the team keeps updating the survey, publishing the latest updates on the associated GitHub Page

Li et al. [2023]. 

Another successful effort to summarize the development of MLLM was made by Li et al. [2023] in the report titled Multimodal Foundational Models: From Specialists to General Purpose-Assistants. The Microsoft team provides an overview of the taxonomy and evolution of multimodal foundation models, making a step beyond MLLM into the realm of developing general-purpose assistants that can follow human intent to complete a wide range of computer vision tasks.

Figure 3: Illustration of the evolution of multimodal foundation models from task-specific pre-trained models based on text. Source: Li et al. [2023].

Cai et al. [2023]

In Cai et al. [2023], a group of researchers makes an effort to overcome the existing challenges faced by multimodal large language models in understanding special positioning of objects while performing visual interpretation. The team introduces a model titled VIP-LLaVa which provides for region-specific image understanding. The model allows to leverage linguistic interactions with visual markers to simplify the process of image understanding and enhance visual referencing.  

 Figure 4. Illustration of the capabilities of ViP-LLaVa allowing to overlay various visual prompts, such as arrows, circles, etc. on the original image to create a visual prompt for a large language model. Source: Cai et  al.   [2023].

Implement AI Technology Easier with VectorShift

As large language models develop and expand their abilities to process multisensory input, the scope of multimodal AI applications becomes truly endless. Meanwhile, introducing multimodal AI into your applications can be expedited through no-code functionality and SDK interfaces available with VectorShift.

At VectorShift, we can help you introduce multimodal language models into your processes with an intuitive end-to-end platform. For more information, please don't hesitate to get in touch with our team or request a free demo.

© 2023 VectorShift, Inc. All Rights Reserved.

© 2023 VectorShift, Inc. All Rights Reserved.

© 2023 VectorShift, Inc. All Rights Reserved.