MiniGPT-5 with unified image and text generation: The model can be continued and automatically illustrated
AD |
Machine Heart ReportMachine Heart Editorial DepartmentThe OpenAI GPT-5 model seems to be far away, but researchers have already pioneered the innovative visual and language cross generation model MiniGPT-5. This is of great significance for generating images with coherent textual descriptions
Machine Heart Report
Machine Heart Editorial Department
The OpenAI GPT-5 model seems to be far away, but researchers have already pioneered the innovative visual and language cross generation model MiniGPT-5. This is of great significance for generating images with coherent textual descriptions.
Large models are achieving language and visual breakthroughs, with the potential to seamlessly understand and generate text and image content. In a recent series of studies, multimodal feature integration is not only a constantly evolving trend, but has also brought key advancements from multimodal dialogues to content creation tools. Large language models have demonstrated unparalleled capabilities in text comprehension and generation. However, generating images with coherent textual narratives at the same time is still a field that needs to be developed.
Recently, a research team at the University of California, Santa Cruz proposed MiniGPT-5, an innovative interlaced visual language generation technology based on the concept of "generative voken".

Paper address:
https://browse.arxiv.org/pdf/2310.02239v1.pdf
Project address:
https://github.com/eric-ai-lab/MiniGPT-5
By combining the StableDiffusion mechanism with LLM through a special visual token 'generative voken', MiniGPT-5 foreshadows a new mode for skilled multimodal generation. At the same time, the two-stage training method proposed in this article emphasizes the importance of the basic stage without description, enabling the model to "thrive" even when data is scarce. The general phase of this method does not require domain specific annotations, which makes the solution in this article completely different from existing methods. In order to ensure the harmony and consistency of the generated text and images, the dual loss strategy proposed in this paper has come into play, and the generative voken method and classification method have further enhanced this effect.
On the basis of these technologies, this work marks a transformative approach. By using ViT (VisionTransformer) and Qformer, as well as large language models, the research team transformed multimodal inputs into generative vokens and seamlessly paired them with high-resolution StableDiffusion 2.1 to achieve context aware image generation. This article combines image as auxiliary input with instruction adjustment methods, and takes the lead in using text and image to generate losses, thereby expanding the synergy between text and vision.
MiniGPT-5 is matched with models such as CLIP constraints, cleverly integrating the diffusion model with MiniGPT-4, achieving good multimodal results without relying on domain specific annotations. Most importantly, the strategy proposed in this article can leverage the progress of the multimodal visual language foundation model to provide a new blueprint for enhancing multimodal generation capabilities.
As shown in the following figure, in addition to its original multimodal understanding and text generation capabilities, MiniGPT5 can also provide reasonable and coherent multimodal output:

The contribution of this article is reflected in three aspects:
- It is recommended to use a multimodal encoder, which represents a novel universal technology and has been proven to be more effective than LLM and reverse generative vokens, and to combine it with StableDiffusion to generate interleaved visual and language outputs (multimodal language models that can perform multimodal generation).
- The focus is on introducing a new two-stage training strategy for non descriptive multimodal generation. The single mode alignment stage obtains high-quality visual features of text alignment from a large number of text image pairs. The multimodal learning stage includes a novel training task, namely prompt context generation, to ensure that visual and textual prompts can be well coordinated and generated. Adding classifier free guidance during the training phase further improves the generation quality.
- Compared with other multimodal generation models, MiniGPT-5 achieved the most advanced performance on the CC3M dataset. MiniGPT-5 has also established new benchmarks on well-known datasets such as VIST and MMDialogue.
Next, let's take a look at the details of the study together.
Method Overview
In order to enable large-scale language models to have multimodal generation capabilities, researchers have introduced a structured framework that integrates pre trained multimodal large-scale language models with text to image generation models. In order to address the differences between different model domains, they introduced a special visual symbol called "generative vokens", which can be trained directly on the original image. In addition, a two-stage training method has been advanced and combined with a classifier free guidance strategy to further improve the generation quality.

Multimodal input stage
The latest progress in multimodal large models (such as MiniGPT-4) mainly focuses on multimodal understanding, which can process images as continuous inputs. In order to extend its functionality to multimodal generation, researchers have introduced generative vokens designed specifically for outputting visual features. In addition, they also adopted parameter efficient fine-tuning techniques within the Large Language Model (LLM) framework for multimodal output learning.
Multimodal output generation
In order to accurately align the generated token with the generated model, the researchers developed a compact mapping module for dimension matching and incorporated several supervised losses, including text space loss and potential diffusion model loss. Text space loss helps the model learn the correct positioning of tokens, while potential diffusion loss directly aligns tokens with appropriate visual features. Due to the fact that the features of generative symbols are directly guided by images, this method does not require a comprehensive image description, thus achieving non descriptive learning.
Training strategy
Given the significant domain shift between the text and image domains, researchers have found that training directly on a limited set of text and image interleaved datasets may lead to misalignment and decreased image quality.
Training strategy token
Experiments and Results
In order to evaluate the effectiveness of the model, researchers selected multiple benchmarks for a series of evaluations. The experiment aims to address several key issues:
- Can MiniGPT-5 generate trustworthy images and reasonable text?
- How does MiniGPT-5 perform compared to other SOTA models in single and multi round interleaved visual language generation tasks?
- What impact does the design of each module have on overall performance?
In order to evaluate the performance of the model on different benchmarks at different training stages, the quantitative analysis samples of MiniGPT-5 are shown in Figure 3:

The evaluation here spans both visual (image related indicators) and linguistic (text indicators) domains to demonstrate the universality and robustness of the proposed model.
VISTFinal Step Assessment
The first set of experiments involves one-step evaluation, which generates corresponding images based on the prompt model in the last step, and the results are shown in Table 1.
MiniGPT-5 SD 2MiniGPT-5LoRA CLIP prompt prompt FID MiniGPT-5 CLIP FID VIST MiniGPT-5 w/o UASTraining strategy

VISTMulti Step Assessment
In a more detailed and comprehensive evaluation, the researchers systematically provided the previous historical background for the model and subsequently evaluated the generated images and narratives at each step.
Tables 2 and 3 summarize the results of these experiments, respectively summarizing the performance of image and language indicators. The experimental results show that MiniGPT-5 can generate coherent and high-quality images using long horizontal multimodal input prompts in all data without affecting the multimodal understanding ability of the original model. This highlights the efficacy of MiniGPT-5 in different environments.


VIST Human Assessment
As shown in Table 4, MiniGPT-5 generated more appropriate text narratives in 57.18% of cases, provided better image quality in 52.06% of cases, and generated more coherent multimodal output in 57.62% of scenarios. Compared to the two-stage baseline that uses text to image prompt narration without including subjunctive mood, these data clearly demonstrate its stronger multimodal generation ability.

MMDialogue Multiple rounds of evaluation
As shown in Table 5, MiniGPT-5 outperforms the baseline model Divter in generating more accurate text responses. Although the generated images have similar quality, MiniGPT-5 outperforms the benchmark model in terms of MM correlation, indicating that it can better learn how to properly locate image generation and generate highly consistent multimodal responses.

How about the effect? Let's take a look at the output results of MiniGPT-5. Figure 7 shows the comparison of baseline models on MiniGPT-5 and CC3M validation sets.

Figure 8 shows the comparison of baseline models on MiniGPT-5 and VIST validation sets.

Figure 9 shows the comparison of the baseline model on the MiniGPT-5 and MMDialogue test sets.

For more research details, please refer to the original paper.
Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])
Mobile advertising space rental |
Tag: and MiniGPT-5 with unified image text generation The model
Read the rich natural ecological database in this set of books
NextBehind the Birth of China's First Barrel of "Zero Carbon Crude Oil"
Guess you like
-
Huyu Xianxiang and AVIC Optoelectronics Institute Forge Strategic Partnership to Shape China's eVTOL Avionics LandscapeDetail
2025-04-02 18:39:02 1
-
Haier Smart Home's 8th Global R&D Innovation Awards: Illuminating Better Lives with Technology, Achieving User SatisfactionDetail
2025-04-02 15:57:33 21
-
Huawei's 2025 China Digital Power Partner Conference: Carbon-Neutral Path for China, Shared Value CreationDetail
2025-03-31 18:57:09 1
-
OPPO Think Tank: A New Paradigm for Chinese Enterprises' Globalization From Wusha Village to the Global High-End MarketDetail
2025-03-31 18:48:21 1
-
ICLR 2025: Chinese Universities and Companies Showcase AI Prowess with Numerous Accepted Papers; Stanford-HKUST Collaboration Achieves Perfect ScoreDetail
2025-03-31 14:54:45 1
-
Huawei HarmonyOS Smart Home Partner Summit: Deep Dive into Spatial Intelligence Transformation and Ecosystem Development StrategyDetail
2025-03-31 13:01:45 1
-
AI Large Models Drive Innovation in Humanoid Robots and Autonomous Driving: 2025 as a Key MilestoneDetail
2025-03-31 13:00:04 1
-
Eight Cities Pilot Credit Supervision Data Openness, Empowering Micro and Small Enterprises with Mobile Payment PlatformsDetail
2025-03-26 09:32:47 1
-
Xiaomi's "Just a Little Profit": The Deep Logic and Sustainability Behind its Low-Margin StrategyDetail
2025-03-25 15:07:32 21
- Detail
-
The Ninth Huawei ICT Competition China Challenge Finals Conclude Successfully: Kunpeng and Ascend Tracks Crown Their ChampionsDetail
2025-03-24 16:26:03 11
-
Ronshen Sugar Cube Refrigerator: The Official Product of the 2025 FIFA Club World Cup, Ushering in a New Era of Healthy Food PreservationDetail
2025-03-24 15:40:35 1
-
Zhihu Launches New Version of Zhihu Straight Answer: Deep Integration of AI and Community to Enhance Professionalism and CredibilityDetail
2025-03-24 14:04:38 1
-
China Construction Ninth Harmony (Zhongjian Jiuhe) and Huawei HarmonyOS Smart Home Deepen Strategic Partnership at AWE2025, Building a Green and Intelligent Future HomeDetail
2025-03-23 15:21:15 41
-
ZuoYeBang Books Leads the New Trend in Intelligent Education Publishing at Changsha Book FairDetail
2025-03-21 15:15:33 1
-
Tianyancha: Shielding Consumer Safety and Reshaping Business Trust with DataDetail
2025-03-21 08:47:58 1
-
Hisense at AWE2025: AI Empowerment, Leading the Transformation of Future Smart LivingDetail
2025-03-20 18:24:11 1
-
Haier TV Makes a Stunning Debut at AWE 2024: Zhiyuan AI Large Model and PureScene Care Screen Usher in a New Era of Smart HomesDetail
2025-03-20 15:17:20 1
-
China Power's Xin Yuan Zhi Chu (New Source Smart Storage): Open Energy Intelligence Computing Center Leads Intelligent Transformation of the Energy IndustryDetail
2025-03-20 15:15:39 1
-
Leader's All-in-One Three-Drum Washing Machine: Say Goodbye to Laundry Hassles and Embrace a "Refined Lazy" LifestyleDetail
2025-03-20 11:32:30 21