POINTS: Improving Your Vision-language Model with Affordable Strategies (2024)

Yuan Liu¹, Zhongyin Zhao^1,2,∗, Ziyuan Zhuang^1,3,∗, Le Tian¹, Xiao Zhou¹, Jie Zhou¹
¹Pattern Recognition Center, WeChat AI, Tencent Inc, China
²Shanghai Jiao Tong University
³Nanjing University
{bensenliu}@tencent.com
^∗ Work done when interned at WeChat AI

Abstract

In recent years, vision-language models have achieved significant advancements, excelling in tasks once deemed challenging, such as optical character recognition and geometric problem-solving.Despite these impressive achievements, several critical issues remain unaddressed:1) Proprietary models rarely disclose detailed information about their architectures. In contrast, while open-source models provide visibility into their training strategies, detailed ablations of these strategies are highly anticipated.2) Pre-training data is currently under-explored in open-source works, with most efforts empirically adding datasets from diverse sources, making the entire process elusive and cumbersome.3) During the fine-tuning stage, the focus is often on adding and ablating more datasets, which frequently leads to diminishing returns.Therefore, refining data schemes is essential for further enhancing model performance.To address these issues, we propose the following contributions in this paper:1) We trained a robust baseline model, leveraging the latest technological advancements in vision-language models.Building upon existing advancements, we introduced effective improvements and conducted comprehensive ablation and validation for each technique incorporated into this strong baseline.2) Inspired by recent work on large language models, we propose filtering pre-training data using perplexity, selecting the data with the lowest perplexity as the training set.This approach allowed us to train on a curated 1M dataset, resulting in highly competitive performance.3) During the visual instruction tuning stage, we experimented with model soup on different datasets when further introducing more datasets into the training set brought marginal improvements.Integrating these innovations, we obtained a model with 9B parameters, performing competitively with a series of existing state-of-the-art models.Additionally, these strategies we propose are efficient and relatively lightweight, allowing the community to adopt them easily for their models.

1 Introduction

Advancements in large language models (LLMs; Chowdhery etal. 2023, Jiang etal. 2023, OpenAI 2022, Yang etal. 2024, Dubey etal. 2024) have significantly enhanced the capabilities of vision-language large models (Fu etal. 2023, Liu etal. 2023b, OpenAI 2023, Dong etal. 2024a, Zhu etal. 2023), enabling more sophisticated analyses of textual and visual information. Prominent closed-source model paradigms such as GPT-4(OpenAI, 2023), Gemini Pro 1.5(Fu etal., 2023), and Claude 3(Anthropic, 2024) have achieved remarkable success in expanding LLMs into the realm of vision-language models. Concurrently, open-source vision-language large models are also advancing rapidly, with numerous notable contributions emerging in the field(Liu etal., 2024b; Chen etal., 2024c).

Historically, LLaVA(Liu etal., 2024b) has served as a common baseline. However, recent advancements have rendered its performance suboptimal. Thus, there is a need to establish a stronger baseline for further exploration. In this work, we enhance the vanilla LLaVA architecture by refining the pre-training dataset. Inspired by CapFusion(Yu etal., 2024), we merge the original captions with world knowledge and generated captions that exhibit good grammatical structure. For visual instruction tuning datasets, we introduce Individual Select(Liu etal., 2024c) to curate effective instruction tuning datasets. Regarding model architecture, we first incorporate Dynamic High Resolution to help the model capture fine-grained details. To address image distortion issues inherent in Dynamic High Resolution, we propose a novel image splitting strategy called Consistent Aspect Ratio Dynamic High Resolution, which maintains a consistent image ratio. Additionally, inspired by Vary(Wei etal., 2023), we merge features from a vision encoder trained separately with text-rich data with those from the original vision encoder, significantly boosting the model’s OCR capabilities. Unlike most existing works(Li etal., 2024; Chen etal., 2024c), we extensively ablate each newly introduced component in the strong baseline to verify their individual benefits.

Recent works seldom explore the optimization of pre-training datasets. Most studies(Chen etal., 2024c; Yao etal., 2024; Bai etal., 2023b) tend to empirically combine samples from various large-scale datasets(Schuhmann etal., 2022; Byeon etal., 2022), often leading to inefficient and computationally expensive pre-training processes. In the domain of large language models, some research leverages perplexity to filter pre-training datasets. Inspired by this approach, we filter our pre-training dataset by selecting the top samples with the lowest perplexity values. This filtering process yields a subset of 1 million data samples, on which we subsequently pre-train our model. Experimental results demonstrate that the model trained on this filtered subset outperforms a model trained on a dataset five times larger.

In the visual instruction tuning stage, most existing works(Liu etal., 2024c; Li etal., 2024; Chen etal., 2024c) focus on collecting large quantities of datasets and performing ablation studies to select the most effective ones. However, this approach often reaches a plateau, where introducing additional datasets yields only marginal or even degraded performance. Previous research on model soup has demonstrated the benefits of merging weights from different models fine-tuned with various hyper-parameters. In this work, we propose using model soup to merge weights from models fine-tuned with different datasets to further improve performance when dataset selection no longer brings significant improvement. Compared to conducting model soup on models fine-tuned with different hyper-parameters, e.g. learning rate, the improvement with model soup on models fine-tuned with different datasets is much more prominent. Following this line of work, we further experiment with different model soup strategies and find that greedy model soup is the most effective.

By integrating the aforementioned innovations, we have developed a model called POINTS. Our contributions are threefold:

$\bullet$
We propose a strong baseline that integrates the latest advancements in vision-language models and thoroughly verify the effectiveness of each component.
$\bullet$
We introduce the use of perplexity to filter the pre-training dataset and conduct a detailed investigation of data distribution across different perplexity intervals.
$\bullet$
We employ model soup to merge models fine-tuned with different datasets, thereby enhancing model performance when further dataset selection yields only marginal improvements.

2 Related Works

Multimodal Large Language Models

The rapid advancement of large language models (LLMs; Dubey etal. 2024, Team etal. 2023, Achiam etal. 2023, Yang etal. 2024, Su etal. 2022) has laid the groundwork for the emergence of multimodal large language models (MLLMs; Li etal. 2024, Liu etal. 2024b, Liu etal. 2024a, Bai etal. 2023a, Qiao etal. 2024), which aim to integrate visual understanding with language reasoning and multimodal perception and comprehension. Prominent models such as GPT-4v(Achiam etal., 2023) and Gemini-1.5-Pro(Team etal., 2023), developed by major corporations, have spearheaded the MLLM era, utilizing proprietary training data and undisclosed training methodologies. Meanwhile, open-source models have been striving to keep pace. For instance, LLaVA-Next(Liu etal., 2024a) and InternVL-1.5(Chen etal., 2024c) introduce dynamic high-resolution techniques by dividing a large image into multiple smaller segments with ratio-inconsistent resizing. MiniCPM-V(Yao etal., 2024) employs a specialized vision encoder to generate non-square image patches. Additionally, models like Vary(Wei etal., 2023), SPHINX(Lin etal., 2023), Cambrian-1(Tong etal., 2024), and Mini-Gemini(Li etal., 2023) propose dual vision encoders to enhance visual capabilities. Furthermore, the significant progress in multimodal model evaluation(Liu etal., 2023c; Chen etal., 2024b; Fang etal., 2024) has also contributed to the rapid improvement of large vision-language models. In this work, we introduce POINT, a model trained exclusively with fully open-source datasets during both the pre-training and supervised fine-tuning (SFT) stages, demonstrating promising results on extensive benchmarks.

Visual Instruction Tuning

The selection of training data for multimodal models is of paramount importance (Laurençon etal., 2024; Tong etal., 2024), and most improvements in existing works stem from detailed ablation of instruction tuning datasets (Li etal., 2024; Liu etal., 2024b; Chen etal., 2024c). The commonly used approach to select the most effective datasets involves iteratively adding each dataset to the pool; if it brings improvement, we keep it, otherwise, we drop it. However, this approach may eventually plateau, as further additions might only yield marginal improvements. To further enhance performance, we propose employing model soup (Wortsman etal., 2022) on different models fine-tuned with various visual instruction tuning datasets. This method involves merging the model weights after visual instruction tuning on diverse datasets, resulting in notable performance improvements.

3 Methods

This section is divided into three parts: i) In subsection3.1, we integrate various techniques from previous methods (Liu etal., 2024a; Lin etal., 2023; Wei etal., 2023; Liu etal., 2024c; Chen etal., 2024c) to create a strong baseline for further experiments. Additionally, we propose a novel dynamic resolution splitting method, termed Consistent Aspect Ratio Dynamic High Resolution (CATTY for short), to mitigate the issue of image distortion. ii) In subsection3.2, we propose using perplexity to filter the pre-training dataset. iii) Finally, in subsection3.3, we incorporate the concept of model soup (Wortsman etal., 2022) into the instruction tuning stage. We find that this straightforward approach can significantly improve the model’s performance, especially when further data selection only brings marginal or even degraded performance.

3.1 A Strong Baseline

In this section, we integrate the recent advancements from existing works to create a strong baseline, containing Dynamic High Resolution from LLaVA-Next(Liu etal., 2024a) and InternVL1.5(Chen etal., 2024c), CapFusion from (Yu etal., 2024), Dual Vision Encoder from Vary(Wei etal., 2023) and SPHINX(Lin etal., 2023), Individual Select from (Liu etal., 2024c). Following LLaVA(Liu etal., 2024b), POINTS mainly contains three parts: vision encoder, projector and the large language model. By integrating all these practices from previous works, we obtain the model structure and pipeline in Figure1.

POINTS: Improving Your Vision-language Model with Affordable Strategies (1)

Dynamic High Resolution

It has been verified that feeding high-resolution images to vision-language models is beneficial for capturing fine-grained details and reducing hallucinations (Liu etal., 2023b). To enable vision encoder with fixed input resolutions to accommodate dynamic image resolutions, Dynamic High Resolution in LLaVA-Next (Liu etal., 2024a) and InternVL-1.5 (Chen etal., 2024c) splits high-resolution images into several tiles of the same resolution, which the original vision encoder can process. The concrete steps are as follows: i) First, the maximum number of tiles an image can be split into is predefined (set to 8 in our experiments). ii) Based on the maximum number of tiles, a table is created containing information about the target image before splitting. The key of the table is the aspect ratio, and the value is the width and height of the target image, which can be evenly divided by the resolution of the vision encoder. iii) For each image, the target resolution is fetched from the pre-computed table according to the similarity between aspect ratios. The current image is then resized to the target resolution and split into several tiles of the same resolution.

Consistent Aspect Ratio Dynamic High Resolution (CATTY)

Before splitting the image, Dynamic High Resolution in InternVL-1.5 (Chen etal., 2024c) resizes the image to the target resolution. However, this resizing is not proportional to the image’s original aspect ratio, which can cause distortion. This issue has been discussed in previous articles(Yao etal., 2024). Therefore, we propose a splitting method that maintains the image’s aspect ratio, named Consistent Aspect Ratio Dynamic High Resolution (see Figure2). The first two steps in CATTY are the same as those in InternVL-1.5, and the last step works as follows: Given an image with height $\mathrm{H}$ and width $\mathrm{W}$ , we obtain the height and width of the referenced image from the pre-computed table, denoted as $\mathrm{H}^{\text{r}}$ and $\mathrm{W}^{\text{r}}$ , respectively. Then, we resize the image to the target size ( $\mathrm{H}^{\text{t}}\times\mathrm{W}^{\text{t}}$ ) by:

ratio	$\displaystyle=\text{min}(\mathrm{H},\mathrm{W})/\text{min}(\mathrm{H}^{\text{r%}},\mathrm{W}^{\text{r}})$	(1)
$\displaystyle\mathrm{H}^{\text{t}}$	$\displaystyle=\text{ratio}\times\mathrm{H}$
$\displaystyle\mathrm{W}^{\text{r}}$	$\displaystyle=\text{ratio}\times\mathrm{W}$

Given the input resolution of a vision encoder, $\mathrm{H}^{\text{v}}\times\mathrm{W}^{\text{v}}$ , the target image should be divided into $\frac{\mathrm{H}^{\text{r}}}{\mathrm{H}^{\text{v}}}\times\frac{\mathrm{W}^{%\text{r}}}{\mathrm{W}^{\text{v}}}$ tiles. Next, we split the target image, $\mathrm{H}^{\text{t}}\times\mathrm{W}^{\text{t}}$ , using a sliding window with strides $(\mathrm{S}^{\text{h}},\mathrm{S}^{\text{w}})$ across the height and width, respectively. The strides $(\mathrm{S}^{\text{h}},\mathrm{S}^{\text{w}})$ are computed as follows:

	$\displaystyle\mathrm{S}^{\text{h}}$	$\displaystyle=(\mathrm{H}^{\text{t}}-\mathrm{H}^{\text{v}})/(\mathrm{H}^{\text%{r}}/\mathrm{H}^{\text{v}}-1)$		(2)
	$\displaystyle\mathrm{S}^{\text{w}}$	$\displaystyle=(\mathrm{W}^{\text{t}}-\mathrm{W}^{\text{v}})/(\mathrm{W}^{\text%{r}}/\mathrm{W}^{\text{v}}-1)$		(2)

POINTS: Improving Your Vision-language Model with Affordable Strategies (2)

In Equation2, $\mathrm{S}^{\text{h}}$ is set to 0 if $\mathrm{H}^{\text{r}}/\mathrm{H}^{\text{v}}=1$ , and similarly for $\mathrm{S}^{\text{w}}$ . This approach allows us to divide a high-resolution image into several tiles without introducing any distortion. Alongside the tiles obtained using CATTY, we also include a thumbnail of the global view of the image to capture the overall context. This thumbnail is resized to match the input resolution of the vision encoder. Before feeding the features output by the vision encoder into the large language model, we employ the pixel shuffle technique with a down-sampling factor of 0.25, as described in InternLM-XComposer2-4KHD (Dong etal., 2024b), to reduce the sequence length of the image features for improved efficiency.

CapFusion

The original captions in existing pre-training datasets are often noisy and structurally flawed, making them sub-optimal for model training. To address this, synthetic captions, such as those in LAION-COCO and BLIP-LAION (Li etal., 2022), generated by image captioning models, have been proposed. However, the simplistic syntactic and semantic structures in synthetic captions may contribute to issues like Scalability Deficiency and World Knowledge Loss (Yu etal., 2024). CapFusion strikes a balance between these two types of captions by utilizing a large language model to organically integrate raw and synthetic captions. This approach extracts real-world knowledge from the structurally flawed raw captions while merging it with the structured but syntactically simplified synthetic captions. Following the CapFusion methodology, we use InternLM-XComposer2 (Dong etal., 2024a) to generate captions for images and InternLM2 (Cai etal., 2024) to integrate the original raw and synthetic captions. The prompts to generate image captions and merge captions are in the appendix.

Dual Vision Encoder

Several previous works, such as SPHINX (Lin etal., 2023) and Cambrian-1 (Tong etal., 2024), have demonstrated that different vision encoders exhibit distinct advantages across various domains. Combining features from multiple encoders can lead to improved and more robust performance. Unlike the perception and reasoning required for natural images, text-intensive images demand different capabilities from vision-language models (Wei etal., 2023). To enhance optical character recognition (OCR) capabilities, we train a separate vision encoder, referred to as the OCR ViT, to extract textual features from images, following the methodology of Vary (Wei etal., 2023). Unlike Vary, we do not construct training samples, such as charts, ourselves; instead, we utilize OCR results (extracted using PaddleOCR in our case) for pre-training. Additionally, we include natural captions in the pre-training dataset for the OCR vision encoder. More details about the composition of the pre-training datasets for the OCR vision encoder will be discussed in the following section. We merge the features from the general vision encoder (called General ViT) and the OCR vision encoder using a weighted average before feeding them into the large language model.

Individual Select

Individual Select, as proposed by Liu etal. (2024c), aims to identify the most effective instruction tuning datasets. Building on this approach, we adopt the dataset composition from Liu etal. (2024c) as our candidate pool and incorporate additional datasets used in DeepSeek-VL (Lu etal., 2024a), Cambrian-1 (Tong etal., 2024), and Cauldron (Laurençon etal., 2024). Ultimately, we integrate 16 more datasets into those identified by Liu etal. (2024c) (further details are provided in subsection4.2). To enhance the diversity of prompts, given the hom*ogeneity in the style of prompts within academic datasets, we employ GPT-4o to generate question-answer pairs in line with previous works (Lu etal., 2024a; Chen etal., 2024c) (the prompt to generate question-answer pairs will be provided in the appendix). The images for these pairs are randomly selected from LAION-5B (Schuhmann etal., 2022). We refer to the final composition of visual instruction tuning datasets as the Base Set.

3.2 Pre-train Data Selection

In the context of large language models, perplexity has long been employed as a metric to assess the quality of pre-trained datasets (Albalak etal., 2024; Marion etal., 2023). Inspired by this approach, we utilize an off-the-shelf vision-language model, $P$ —either the model obtained through the steps outlined in subsection3.1 or an open-sourced VLM—to further filter out low-quality pre-trained datasets obtained via Capfusion, as described above. For each item, $s$ , in the pre-trained dataset mentioned in subsection3.1, we compute the perplexity for all text tokens using the following formula:

\displaystyle\mathrm{Perplexity}(s)

\displaystyle=\mathrm{exp}(-\frac{1}{N}\sum_{i=1}^{N}\mathrm{log}P(w_{i}|w_{1}%,w_{2},...,w_{i-1}))

(3)

Let $\{w_{1},\ldots,w_{N}\}$ represent the text token sequence for $s$ . We sort all these items in ascending order and select the first $20\%$ for the pre-training stage. Upon closer examination of the first and last $20\%$ of items, we observe that the distinguishing factor is not the quality of the data, which contrasts with observations in large language models. The last $20\%$ of items often contain obscure world knowledge, such as game version numbers and computer factory serial numbers. This type of world knowledge is extremely rare and contains very little information, making it less beneficial for the model’s learning. In the appendix, we provide some examples randomly sampled from the first and last $20\%$ of items.

3.3 Instruction Data Selection with Model Soup

Visual instruction tuning data is crucial for the superior performance of existing vision-language models (Chen etal., 2024c; Dong etal., 2024a; Liu etal., 2024b). However, most existing works focus on selecting more effective datasets by iterative ablation. In many cases, this approach reaches a plateau, where further data selection can only bring marginal improvements or even degrade performance. In this section, we introduce the benefits of using model soup to integrate the advantages of models fine-tuned with different instruction tuning datasets after data selection meets a bottleneck. The philosophy behind model soup is as follows: given a pre-trained model, fine-tuning the model with different hyper-parameters, $h_{1},\ldots,h_{k}$ , results in several fine-tuned models converging to different local optima, denoted as $f(\theta_{1},h_{1}),\ldots,f(\theta_{k},h_{k})$ . These hyper-parameters include learning rate, data augmentation, initialization seed, etc. By interpolating the weights of these fine-tuned models, we can always obtain a stronger model, $f(\theta_{s},h_{s})$ . Given the pre-trained model obtained through the methods discussed above, a base instruction tuning dataset $D$ , and a series of visual instruction tuning datasets $d_{1},\ldots,d_{k}$ to be selected, we can obtain a stronger model using the following steps:

$\bullet$
For each dataset $d_{i}\in\{d_{1},...,d_{k}\}$ , we add it to the base instruction tuning dataset, $D$ , to obtain an augmented dataset, $D^{\ast}_{i}$ .
$\bullet$
We train $k$ models using each augmented from $\{D^{\ast}_{1},...,D^{\ast}_{k}\}$ concurrently, and obtain $\{f(D^{\ast}_{1};\theta_{1}),...,f(D^{\ast}_{k};\theta_{k})\}$ .
$\bullet$
We select $p$ models from $\{f(D^{\ast}_{1};\theta_{1}),...,f(D^{\ast}_{k};\theta_{k})\}$ , and merge the weights from all these selected models to obtain a stronger model.

For the third step above, we choose several methods to select the best composition of fine-tuned models to obtain a final model with superior performance, namely, Maximum Soup, Average Soup, and Greedy Soup.

Maximum Soup

Given an evaluation score, $\mathrm{Acc}$ , we can obtain a strong model, $f(\theta_{s})$ , using the following formula:

	$\displaystyle\{\theta_{i}\}_{\mathrm{len}(\{\theta_{i}\})=p}$	$\displaystyle=\mathrm{Arg}_{(\theta_{i})}(\mathrm{Top}_{p}(\{\mathrm{Acc}(f(D^%{\ast}_{1};\theta_{1})),...,\mathrm{Acc}(f(D^{\ast}_{k};\theta_{k})\})))$		(4)
	$\displaystyle f(\theta_{s})$	$\displaystyle=f(\frac{1}{p}\sum_{i=1}^{p}\theta_{i})$		(4)

Average Soup

By taking the average of weights from all fine-tuned models, we can obtain a stronger model, $f(\theta_{s})$ :

\displaystyle f(\theta_{s})

\displaystyle=f(\frac{1}{p}\sum_{i=1}^{k}\theta_{i})

(5)

Greedy Soup

We start by sorting the fine-tuned models in descending order based on their evaluation scores. Next, we iterate through these sorted models. For each model, we compute the average of its weights with those of all models currently in the model pool. If the evaluation score improves, the model is added to the pool. Finally, we average the weights of all models in the pool to obtain a stronger model, denoted as $f(\theta_{s})$ . The table below outlines the detailed pipeline of Greedy Soup.

1:INPUT: $k$ fine-tuned models with different datasets, $\{f(D^{\ast}_{i};\theta_{i})\}$

2:INPUT: the evaluation score, $\mathrm{Acc}$

3:INPUT: model pool, $\mathrm{P}\leftarrow\{\}$

4:for $i=1$ to $k$ do

5:if $\mathrm{Acc}(f(\mathrm{average}(\mathrm{P},\theta_{i}))))\geq\mathrm{Acc}(f(%\mathrm{average}(\mathrm{P}))$ then

6: $\mathrm{P}\leftarrow\mathrm{P}\cup\theta_{i}$

7:endif

8:endfor

9:Return $\mathrm{average}(\mathrm{P})$

4 Experiments

This section is divided into five subsections: (i) evaluation setup, (ii) pre-training and instruction-tuning datasets used to train the strong baseline, as well as the instruction-tuning datasets selected by model soup, (iii) details about the training setup for the OCR ViT pre-training, the vision-language pre-training, and the visual instruction tuning stages, (iv) ablation studies and analyses of each component used to build our final model, and (v) comparison with other works on extensive benchmarks.

4.1 Evaluation Setup

Before embarking on our exploration, we sought a robust evaluation metric to comprehensively assess the various capabilities of our model. This is where OpenCompass (Contributors, 2023) proves helpful. OpenCompass proposes eight benchmarks to balance the evaluation of a model from different perspectives. These benchmarks include MMBench (Liu etal., 2023c) and MMStar (Chen etal., 2024a) for diagnosing general abilities, MMMU (Yue etal., 2024) for testing STEM-related abilities, HallusionBench (Liu etal., 2023a) for model hallucination, MathVista (Lu etal., 2023) for math-related abilities, AI2D (Kembhavi etal., 2016) for chart-related abilities, OCRBench (Liu etal., 2023d) for OCR capabilities, and MMVet (Yu etal., 2023) for subjective evaluation. By averaging the metrics from these benchmarks, OpenCompass derives a score that represents the comprehensive ability of a model. Additionally, it offers a useful tool, VLMEvalKit (Duan etal., 2024), for one-click evaluation. Therefore, unless otherwise specified, we will use these eight benchmarks for our ablation study, with the exception of MMBench, for which we will use the dev-en split.

4.2 Data Setup

Pre-train Dataset

To train the OCR ViT, we randomly selected 20 million data points from LAION-5B-en (Schuhmann etal., 2022), LAION-5B-cn (Schuhmann etal., 2022), WuKong (Gu etal., 2022), and Zero (Gu etal., 2022). We then used PaddleOCR to extract text from the images, replacing the original captions to form new image-caption pairs for pre-training. Following Vary (Wei etal., 2023), we also included 10 million original data samples from LAION-5B, where the captions are the original ones crawled from the Internet. However, we did not adopt the cumbersome pipeline of constructing a new dataset for OCR enhancement, such as crawling PDF files and converting them to images for training (Bai etal., 2023b), as we found our existing pipeline already performs well on OCR-related tasks. For the vision-language pre-training in constructing the strong baseline, we used CapFusion to construct 20 million data points (note that these data do not overlap with those used in OCR ViT pre-training) from LAION-5B. From this set, we selected 5 million data points, as we found this setting works best, similar to the observation in Liu etal. (2024c). Based on the 5 million data points, we further selected a 1 million dataset for the final vision-language alignment by choosing the top 20% of data with the lowest perplexity value.

Visual Instruction Tuning Dataset

Based on the datasets identified by Liu etal. (2024c), we further employ Individual Select to choose additional datasets from those proposed in (Lu etal., 2024a), (Tong etal., 2024), and (Laurençon etal., 2024). The final composition of datasets, referred to as the Base Set, used to construct the robust baseline is presented in appendix. Additionally, we include the datasets selected by model soup to derive the final model, which are highlighted in red.

4.3 Training Setup

Pre-training Setup for OCR ViT

The pre-training framework follows the standard LLaVA-style architecture (Liu etal., 2023b), comprising a vision encoder, a two-layer MLP, and a large language model. The vision encoder is initialized from OpenAI’s CLIP-ViT-Large-336¹¹1https://huggingface.co/openai/clip-vit-large-patch14-336, while the large language model is initialized from Yi-1.5-9B-Chat (Young etal., 2024). Throughout the pre-training stage, the large language model remains frozen, whereas the vision encoder and MLP are trainable. The learning rates for the vision encoder and MLP are set to $2\times 10^{-4}$ and $2\times 10^{-5}$ , respectively, with a warm-up schedule during the first 3% of steps, followed by a cosine decay schedule for the remaining steps.

Setup for the Vision-language Pre-training Stage

The General ViT, depicted in Figure1, is initialized from OpenAI’s CLIP-ViT-Large-336, while the OCR ViT is derived from the preceding stage. For the General ViT, only the last three layers are trainable, as this configuration yielded the best results in our experiments. The OCR ViT remains frozen throughout this stage, consistent with the settings used in Vary(Wei etal., 2023). Features from the penultimate layer of both the General and OCR ViT are selected and fed into the projector. The projector itself is a two-layer MLP, which remains tunable during the pre-training stage. The learning rates for the General ViT and the MLP are set to $2\times 10^{-4}$ and $2\times 10^{-5}$ , respectively. A warm-up schedule is applied during the first 3% of steps, followed by a cosine decay schedule for the remaining steps.

POINTS: Improving Your Vision-language Model with Affordable Strategies (3)

Setup for the Visual Instruction Tuning Stage

Both the General ViT and OCR ViT remain frozen throughout the entire stage. The learning rates for the projector and the large language model are both set to $2\times 10^{-5}$ . A warm-up schedule is applied during the first 3% of steps, followed by a cosine decay schedule for the remaining steps.

4.4 Ablation Study and Analysis

Each Component to Build the Strong Baseline

As shown in Table1, each component introduced in subsection3.1 contributes to steady improvements. These enhancements are significant; for instance, after introducing Dynamic High Resolution to split the input image, we observe substantial improvements in OCR-related tasks, such as OCRBench, with performance increasing from 56.9% to 60.3%. Additionally, the use of high-resolution images with Dynamic High Resolution helps reduce hallucination, primarily due to the increased detail in the high-resolution images. Furthermore, replacing the original Dynamic High Resolution with CATTY results in notable improvements across various benchmarks, with OCR-related benchmarks showing greater gains than others. This is likely because image distortion has a more pronounced negative impact on text within images. Compared to general visual feature extraction, the ability to extract text features from images is limited for CLIP-ViT(Radford etal., 2021), as it was trained on a large quantity of general image-text pairs. Consequently, we observe substantial improvements on OCRBench after integrating features from an additional ViT, post-trained on text-rich images. Among the 5 strategies, incorporating more visual instruction tuning datasets by Individual Select yields the most significant improvements. This observation aligns with existing works(Chen etal., 2024c; Li etal., 2024; Tong etal., 2024), underscoring the importance of selecting effective datasets during the visual instruction tuning stage.

CF	DHR	CATTY	DVE	IS	MMB	MV	HB	OCR	AI2D	MMVet	MMStar	MMMU	Overall
					72.1	43.8	35.7	48.9	73.2	40.0	51.5	36.2	50.2
✓					74.8	44.8	35.5	49.6	74.5	41.2	51.8	36.4	51.1
✓	✓				75.1	45.1	38.0	55.6	75.0	42.2	52.6	37.5	52.6
✓		✓			75.9	45.7	39.1	56.9	75.8	43.1	52.4	37.9	53.4
✓		✓	✓		77.3	49.2	42.3	60.3	76.0	44.5	54.3	40.1	55.5
✓		✓	✓	✓	80.1	57.4	44.2	69.2	76.4	47.2	54.5	43.3	59.0

#num	perplexity	Overall
5M		59.0
20M		58.8
1M	✓	59.6

DVE	SVE	OCR	Overall
✓		69.2	59.0
	✓	67.3	57.2

lr	ds	Model Soup	Overall
		baseline	59.0
✓		Greey Soup	59.2
	✓	Maximum Soup	61.0
	✓	Average Soup	61.2
	✓	Greedy Soup	61.8

Pre-train Dataset

As shown in Table3, scaling up the dataset size (constructed by CapFusion) from 5M to 20M results in downgraded performance, similar to the observations in Liu etal. (2024c). Additionally, some works also achieve promising performance using relative small pre-training datasets instead of a huge number of datasets during the pre-training stage (Li etal., 2024; Liu etal., 2024a). We believe the possible reasons are: i) The vision encoder of most existing vision-language models is initialized from a pre-trained model that has already been trained on a large quantity of image-text pairs. It is highly likely that most of the data used in the vision-language pre-training stage has already been seen by the vision encoder, thus bringing only marginal or even negative impact when scaling up the size of the vision-language pre-training dataset. ii) The pre-training datasets are quite hom*ogeneous for existing large-scale web-crawled datasets, e.g., LAION-5B and COYO-700M (Byeon etal., 2022). We plot the distribution of the main entity for each image of a subset extracted from LAION-5B in the appendix and find that this distribution is long-tailed and constrained to a few objects, e.g., person. Thus, indiscriminately pre-training the model on such datasets can only bring limited benefits. As shown in the third row, we can improve performance by pre-training the model on merely 1M data, coming from the top 20% of the 5M data in the first row, which has the lowest perplexity. This result shows that excessively exposing the model to obscure and scarce knowledge during the transition is detrimental to its learning. Furthermore, compared to fusing features from a separate OCR-enhanced vision encoder, introducing a large OCR dataset has two obvious drawbacks: i) During the pre-training stage, the model has to align both the general features and OCR-related features, which may result in conflicts (Wei etal., 2023). ii) Since the size of the dataset used in vision-language pre-training is relatively small, a large OCR dataset may overwhelm the learning process, which is not helpful for learning other kinds of knowledge. Thus, introducing features from another OCR ViT can yield superior performance in Table4.

Methods	MMB	MV	HB	OCR	AI2D	MMVet	MMStar	MMMU	SCI	MME	RWQ	Wild
Proprietary models
GPT-4o-0513	-	61.3	55.0	736	84.6	69.1	63.9	69.2	90.7	2310.3	75.4	102.0
Claude3.5-Sonnet	-	61.6	49.9	788	80.2	66.0	62.2	65.9	88.9	1920.0	60.1	81.0
Gemini-1.5-Pro	-	57.7	45.6	754	79.1	64.0	59.1	60.6	85.7	2110.6	64.1	95.3
Open-source models
Cambrian-34B	81.4	50.3	41.6	591	79.5	53.2	54.2	50.4	85.6	2049.9	67.1	82.0
Ovis1.5-LLaMA3-8B	-	63.0	45.0	744	82.5	50.9	57.3	48.3	88.8	1948.5	64.2	79.9
InternVL2-8B	-	58.3	45.0	794	83.6	54.3	61.5	51.2	97.1	2215.1	64.2	73.3
IXC-2.5	-	63.7	43.1	686	81.6	49.3	59.9	42.9	96.6	2233.1	67.8	70.2
OneVision	80.8	63.2	-	-	81.4	57.5	61.7	48.8	96.0	1998.0	66.3	67.8
Ours
POINTS-9B	83.2	60.7	48.0	70.6	78.5	50.0	56.4	46.9	92.9	2017.8	65.9	69.3

Improve the Performance with Model Soup on Different Datasets.

As described in previous sections, increasing adding more instruction tuning datasets often reaches a plateau where further increasing the number of datasets yields minimal improvement. However, by incorporating model soup across different datasets, we observe substantial enhancements, as shown in Table5, with the overall score increasing from 59.0 to 61.2. We also compare the benefits of various model soup strategies. Among them, greedy soup achieves the best performance, outperforming maximum soup and average soup by 0.6 and 0.4 points, respectively. Unless otherwise specified, we will use greedy soup by default in subsequent experiments. Additionally, we include the results of conducting model soup over different hyperparameters, e.g. different learning rates. As shown, model soup over hyperparameters brings only marginal improvement; detailed results are provided in the appendix. Furthermore, we verify in the appendix that model soup consistently improves performance regardless of the Base Set used.

4.5 Comparison with Other Works

In addition to the 8 benchmarks used in the ablation studies above, we further include ScienceQA (Lu etal., 2022), MME (Yin etal., 2023), LLaVA-Wild (Liu etal., 2024b), and ReadWorldQA to compare the performance of different models. The following table shows the performance of these models. As shown in Table5, POINTS achieves performance comparable to existing state-of-the-art models of similar size and even surpasses models with much larger sizes, such as Cambrian-34B. Additionally, compared to the models listed in the table, POINTS uses a much smaller pre-training dataset (e.g., 1M), fewer visual instruction tuning datasets, and all the datasets we used are publicly available. This makes it more affordable for the community to adopt the strategies proposed in this paper. Furthermore, each aspect of POINTS is clearly presented and thoroughly analyzed, making the effectiveness of each strategy employed in our model evident.

5 Conclusion

Vision-language models have achieved significant progress in recent years. Following this trend (Chen etal., 2024c; Li etal., 2024; Liu etal., 2024b; Zhang etal., 2024; Tong etal., 2024), we first establish a strong baseline by integrating various advancements proposed in recent works (Liu etal., 2024a; Yu etal., 2024; Wei etal., 2023; Liu etal., 2024c) for further experiments. Additionally, we delve into the intricate details of these advancements and propose effective refinements, such as the Consistent Aspect Ratio Dynamic High Resolution. We also conduct extensive experiments to verify the effectiveness of each component in constructing the strong baseline. Secondly, we propose using perplexity to filter the pre-training dataset, retaining only the top 20% of data with the smallest perplexity values during the pre-training stage. This filtering method also brings significant improvements. Model Soup (Wortsman etal., 2022) has shown promising potential to further enhance performance by averaging the weights of fine-tuned models with different hyperparameters. However, we find that conducting model soup over different dataset settings can yield even more substantial improvements. In this paper, we propose conducting model soup across models fine-tuned with varying dataset quantities and diversities. The improvements brought by quantity and diversity are orthogonal to each other and can result in joint enhancement.

References

Achiam etal. (2023)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
Albalak etal. (2024)Alon Albalak, Yanai Elazar, SangMichael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, etal.A survey on data selection for language models.arXiv preprint arXiv:2402.16827, 2024.
Anthropic (2024)Anthropic.Claude 3 haiku: Our fastest model yet, 2024.URL https://www.anthropic.com/news/claude-3-haiku.
Bai etal. (2023a)Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023a.
Bai etal. (2023b)Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou.Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023b.
Byeon etal. (2022)Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim.Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022.
Cai etal. (2024)Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, etal.Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024.
Chen etal. (2024a)Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, YuQiao, Dahua Lin, etal.Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024a.
Chen etal. (2024b)Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, etal.Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai.arXiv preprint arXiv:2408.03361, 2024b.
Chen etal. (2024c)Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, etal.How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.arXiv preprint arXiv:2404.16821, 2024c.
Chowdhery etal. (2023)Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, HyungWon Chung, Charles Sutton, Sebastian Gehrmann, etal.Palm: Scaling language modeling with pathways.Journal of Machine Learning Research, 24(240):1–113, 2023.
Contributors (2023)OpenCompass Contributors.Opencompass: A universal evaluation platform for foundation models.https://github.com/open-compass/opencompass, 2023.
Dong etal. (2024a)Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, YuQiao, Dahua Lin, and Jiaqi Wang.Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model.arXiv preprint arXiv:2401.16420, 2024a.
Dong etal. (2024b)Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, YuQiao, Dahua Lin, and Jiaqi Wang.Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd.arXiv preprint arXiv:2404.06512, 2024b.
Duan etal. (2024)Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, etal.Vlmevalkit: An open-source toolkit for evaluating large multi-modality models.arXiv preprint arXiv:2407.11691, 2024.
Dubey etal. (2024)Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, etal.The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.
Fang etal. (2024)Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen.Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.arXiv preprint arXiv:2406.14515, 2024.
Fu etal. (2023)Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, etal.A challenger to gpt-4v? early explorations of gemini in visual expertise.arXiv preprint arXiv:2312.12436, 2023.
Gu etal. (2022)Jiaxi Gu, Xiaojun Meng, Guansong Lu, LuHou, Niu Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, etal.Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark.Advances in Neural Information Processing Systems, 35:26418–26431, 2022.
Jiang etal. (2023)AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, etal.Mistral 7b.arXiv preprint arXiv:2310.06825, 2023.
Kembhavi etal. (2016)Aniruddha Kembhavi, Michael Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi.A diagram is worth a dozen images.ArXiv, abs/1603.07396, 2016.URL https://api.semanticscholar.org/CorpusID:2682274.
Laurençon etal. (2024)Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh.What matters when building vision-language models?arXiv preprint arXiv:2405.02246, 2024.
Laurençon etal. (2024)Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh.What matters when building vision-language models?, 2024.URL https://arxiv.org/abs/2405.02246.
Li etal. (2024)BoLi, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li.Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024.
Li etal. (2022)Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi.Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.In International conference on machine learning, pp. 12888–12900. PMLR, 2022.
Li etal. (2023)Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia.Mini-gemini: Mining the potential of multi-modality vision language models.arXiv:2403.18814, 2023.
Lin etal. (2023)Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, etal.Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models.arXiv preprint arXiv:2311.07575, 2023.
Liu etal. (2023a)f*ckiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou.Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models.arXiv preprint arXiv:2310.14566, 2023a.
Liu etal. (2023b)Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee.Improved baselines with visual instruction tuning, 2023b.
Liu etal. (2024a)Haotian Liu, Chunyuan Li, Yuheng Li, BoLi, Yuanhan Zhang, Sheng Shen, and YongJae Lee.Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a.URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
Liu etal. (2024b)Haotian Liu, Chunyuan Li, Qingyang Wu, and YongJae Lee.Visual instruction tuning.Advances in neural information processing systems, 36, 2024b.
Liu etal. (2023c)Yuan Liu, Haodong Duan, Yuanhan Zhang, BoLi, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, etal.Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023c.
Liu etal. (2024c)Yuan Liu, LeTian, Xiao Zhou, and Jie Zhou.Rethinking overlooked aspects in vision-language models.arXiv preprint arXiv:2405.11850, 2024c.
Liu etal. (2023d)Yuliang Liu, Zhang Li, Biao Yang, Chunyuan Li, Xucheng Yin, Cheng-lin Liu, Lianwen Jin, and Xiang Bai.On the hidden mystery of ocr in large multimodal models.arXiv preprint arXiv:2305.07895, 2023d.
Lu etal. (2024a)Haoyu Lu, Wen Liu, BoZhang, Bingxuan Wang, Kai Dong, BoLiu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Yaofeng Sun, etal.Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024a.
Lu etal. (2022)Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan.Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
Lu etal. (2023)Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao.Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023.
Lu etal. (2024b)Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye.Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024b.
Marion etal. (2023)Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, and Sara Hooker.When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564, 2023.
OpenAI (2022)OpenAI.Chatgpt, 2022.URL https://openai.com/blog/chatgpt.
OpenAI (2023)OpenAI.Gpt-4 technical report.Technical Report 1, 2, 9, 10, OpenAI, 2023.URL https://example.com/gpt4-technical-report.
Qiao etal. (2024)Yuxuan Qiao, Haodong Duan, Xinyu Fang, Junming Yang, Lin Chen, Songyang Zhang, Jiaqi Wang, Dahua Lin, and Kai Chen.Prism: A framework for decoupling and assessing the capabilities of vlms.arXiv preprint arXiv:2406.14544, 2024.
Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
Schuhmann etal. (2022)Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, etal.Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
Su etal. (2022)Hui Su, Xiao Zhou, Houjin Yu, Xiaoyu Shen, Yuwen Chen, Zilin Zhu, Yang Yu, and Jie Zhou.Welm: A well-read pre-trained language model for chinese.arXiv preprint arXiv:2209.10372, 2022.
Team etal. (2023)Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM Dai, Anja Hauth, etal.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023.
Tong etal. (2024)Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, SaiCharitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, etal.Cambrian-1: A fully open, vision-centric exploration of multimodal llms.arXiv preprint arXiv:2406.16860, 2024.
Wei etal. (2023)Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang.Vary: Scaling up the vision vocabulary for large vision-language models.arXiv preprint arXiv:2312.06109, 2023.
Wortsman etal. (2022)Mitchell Wortsman, Gabriel Ilharco, SamirYa Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, AriS Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, etal.Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time.In International conference on machine learning, pp. 23965–23998. PMLR, 2022.
Yang etal. (2024)AnYang, Baosong Yang, Binyuan Hui, BoZheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, etal.Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024.
Yao etal. (2024)Yuan Yao, Tianyu Yu, AoZhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, etal.Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024.
Yin etal. (2023)Shukang Yin, Chaoyou Fu, Sirui Zhao, KeLi, Xing Sun, Tong Xu, and Enhong Chen.A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023.
Young etal. (2024)Alex Young, Bei Chen, Chao Li, Chengen Huang, GeZhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, etal.Yi: Open foundation models by 01. ai.arXiv preprint arXiv:2403.04652, 2024.
Yu etal. (2024)Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, and Jingjing Liu.Capsfusion: Rethinking image-text data at scale.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14022–14032, 2024.
Yu etal. (2023)Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023.
Yue etal. (2024)Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, GeZhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, etal.Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9556–9567, 2024.
Zhang etal. (2024)Pan Zhang, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Rui Qian, Lin Chen, Qipeng Guo, Haodong Duan, Bin Wang, Linke Ouyang, etal.Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output.arXiv preprint arXiv:2407.03320, 2024.
Zhu etal. (2023)Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny.Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023.