MiniGPT-5 with unified image and text generation: The model can be continued and automatically illustrated

Tech 2023-10-09 23:06:51 Source: Network

Machine Heart ReportMachine Heart Editorial DepartmentThe OpenAI GPT-5 model seems to be far away, but researchers have already pioneered the innovative visual and language cross generation model MiniGPT-5. This is of great significance for generating images with coherent textual descriptions

Machine Heart Report

Machine Heart Editorial Department

The OpenAI GPT-5 model seems to be far away, but researchers have already pioneered the innovative visual and language cross generation model MiniGPT-5. This is of great significance for generating images with coherent textual descriptions.

Large models are achieving language and visual breakthroughs, with the potential to seamlessly understand and generate text and image content. In a recent series of studies, multimodal feature integration is not only a constantly evolving trend, but has also brought key advancements from multimodal dialogues to content creation tools. Large language models have demonstrated unparalleled capabilities in text comprehension and generation. However, generating images with coherent textual narratives at the same time is still a field that needs to be developed.

Recently, a research team at the University of California, Santa Cruz proposed MiniGPT-5, an innovative interlaced visual language generation technology based on the concept of "generative voken".

Paper address:
https://browse.arxiv.org/pdf/2310.02239v1.pdf

Project address:
https://github.com/eric-ai-lab/MiniGPT-5

By combining the StableDiffusion mechanism with LLM through a special visual token 'generative voken', MiniGPT-5 foreshadows a new mode for skilled multimodal generation. At the same time, the two-stage training method proposed in this article emphasizes the importance of the basic stage without description, enabling the model to "thrive" even when data is scarce. The general phase of this method does not require domain specific annotations, which makes the solution in this article completely different from existing methods. In order to ensure the harmony and consistency of the generated text and images, the dual loss strategy proposed in this paper has come into play, and the generative voken method and classification method have further enhanced this effect.

On the basis of these technologies, this work marks a transformative approach. By using ViT (VisionTransformer) and Qformer, as well as large language models, the research team transformed multimodal inputs into generative vokens and seamlessly paired them with high-resolution StableDiffusion 2.1 to achieve context aware image generation. This article combines image as auxiliary input with instruction adjustment methods, and takes the lead in using text and image to generate losses, thereby expanding the synergy between text and vision.

MiniGPT-5 is matched with models such as CLIP constraints, cleverly integrating the diffusion model with MiniGPT-4, achieving good multimodal results without relying on domain specific annotations. Most importantly, the strategy proposed in this article can leverage the progress of the multimodal visual language foundation model to provide a new blueprint for enhancing multimodal generation capabilities.

As shown in the following figure, in addition to its original multimodal understanding and text generation capabilities, MiniGPT5 can also provide reasonable and coherent multimodal output:

The contribution of this article is reflected in three aspects:

It is recommended to use a multimodal encoder, which represents a novel universal technology and has been proven to be more effective than LLM and reverse generative vokens, and to combine it with StableDiffusion to generate interleaved visual and language outputs (multimodal language models that can perform multimodal generation).
The focus is on introducing a new two-stage training strategy for non descriptive multimodal generation. The single mode alignment stage obtains high-quality visual features of text alignment from a large number of text image pairs. The multimodal learning stage includes a novel training task, namely prompt context generation, to ensure that visual and textual prompts can be well coordinated and generated. Adding classifier free guidance during the training phase further improves the generation quality.
Compared with other multimodal generation models, MiniGPT-5 achieved the most advanced performance on the CC3M dataset. MiniGPT-5 has also established new benchmarks on well-known datasets such as VIST and MMDialogue.

Next, let's take a look at the details of the study together.

Method Overview

In order to enable large-scale language models to have multimodal generation capabilities, researchers have introduced a structured framework that integrates pre trained multimodal large-scale language models with text to image generation models. In order to address the differences between different model domains, they introduced a special visual symbol called "generative vokens", which can be trained directly on the original image. In addition, a two-stage training method has been advanced and combined with a classifier free guidance strategy to further improve the generation quality.

Multimodal input stage

The latest progress in multimodal large models (such as MiniGPT-4) mainly focuses on multimodal understanding, which can process images as continuous inputs. In order to extend its functionality to multimodal generation, researchers have introduced generative vokens designed specifically for outputting visual features. In addition, they also adopted parameter efficient fine-tuning techniques within the Large Language Model (LLM) framework for multimodal output learning.

Multimodal output generation

In order to accurately align the generated token with the generated model, the researchers developed a compact mapping module for dimension matching and incorporated several supervised losses, including text space loss and potential diffusion model loss. Text space loss helps the model learn the correct positioning of tokens, while potential diffusion loss directly aligns tokens with appropriate visual features. Due to the fact that the features of generative symbols are directly guided by images, this method does not require a comprehensive image description, thus achieving non descriptive learning.

Training strategy

Given the significant domain shift between the text and image domains, researchers have found that training directly on a limited set of text and image interleaved datasets may lead to misalignment and decreased image quality.

Training strategy token

Experiments and Results

In order to evaluate the effectiveness of the model, researchers selected multiple benchmarks for a series of evaluations. The experiment aims to address several key issues:

Can MiniGPT-5 generate trustworthy images and reasonable text?
How does MiniGPT-5 perform compared to other SOTA models in single and multi round interleaved visual language generation tasks?
What impact does the design of each module have on overall performance?

In order to evaluate the performance of the model on different benchmarks at different training stages, the quantitative analysis samples of MiniGPT-5 are shown in Figure 3:

The evaluation here spans both visual (image related indicators) and linguistic (text indicators) domains to demonstrate the universality and robustness of the proposed model.

VISTFinal Step Assessment

The first set of experiments involves one-step evaluation, which generates corresponding images based on the prompt model in the last step, and the results are shown in Table 1.

MiniGPT-5 SD 2MiniGPT-5LoRA CLIP prompt prompt FID MiniGPT-5 CLIP FID VIST MiniGPT-5 w/o UASTraining strategy

VISTMulti Step Assessment

In a more detailed and comprehensive evaluation, the researchers systematically provided the previous historical background for the model and subsequently evaluated the generated images and narratives at each step.

Tables 2 and 3 summarize the results of these experiments, respectively summarizing the performance of image and language indicators. The experimental results show that MiniGPT-5 can generate coherent and high-quality images using long horizontal multimodal input prompts in all data without affecting the multimodal understanding ability of the original model. This highlights the efficacy of MiniGPT-5 in different environments.

VIST Human Assessment

As shown in Table 4, MiniGPT-5 generated more appropriate text narratives in 57.18% of cases, provided better image quality in 52.06% of cases, and generated more coherent multimodal output in 57.62% of scenarios. Compared to the two-stage baseline that uses text to image prompt narration without including subjunctive mood, these data clearly demonstrate its stronger multimodal generation ability.

MMDialogue Multiple rounds of evaluation

As shown in Table 5, MiniGPT-5 outperforms the baseline model Divter in generating more accurate text responses. Although the generated images have similar quality, MiniGPT-5 outperforms the benchmark model in terms of MM correlation, indicating that it can better learn how to properly locate image generation and generate highly consistent multimodal responses.

How about the effect? Let's take a look at the output results of MiniGPT-5. Figure 7 shows the comparison of baseline models on MiniGPT-5 and CC3M validation sets.

Figure 8 shows the comparison of baseline models on MiniGPT-5 and VIST validation sets.

Figure 9 shows the comparison of the baseline model on the MiniGPT-5 and MMDialogue test sets.

For more research details, please refer to the original paper.

Disclaimer: The content of this article is sourced from the internet. The copyright of the text, images, and other materials belongs to the original author. The platform reprints the materials for the purpose of conveying more information. The content of the article is for reference and learning only, and should not be used for commercial purposes. If it infringes on your legitimate rights and interests, please contact us promptly and we will handle it as soon as possible! We respect copyright and are committed to protecting it. Thank you for sharing.(Email:[email protected])

Mobile advertising space rental

Tag: and MiniGPT-5 with unified image text generation The model