The Data Daily

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
by   Junke Wang , et al.

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.
1 Introduction
Vision-language pretraining has been demonstrated to be a promising direction for building foundation models that can support a broad range of downstream AI tasks. By pretraining on web-scale noisy image-text data, the pioneering works 
Radford et al. ( 2021 ); Hu et al. ( 2021 ); Yuan et al. ( 2021 )
suggest a unified model can be equipped with unprecedented capabilities (e.g., zero-shot classification) and achieve outstanding performance on various tasks, thus significantly reducing the cost of designing task-specific models. Following this thread, some works 
Singh et al. ( 2021 ); Wang et al. ( 2022a ); Alayrac et al. ( 2022 ); Li et al. ( 2022 ); Zhu et al. ( 2021 )
are further proposed to support more tasks. There are also some efforts 
Fu et al. ( 2021 ); Zellers et al. ( 2021 )
studying video-language pretraining to solve video-related multi-modal tasks.
Yuan et al. ( 2021 )
OmniVL (Ours)
Table 1: A system-level comparison between OmniVL and existing Vision-Languange pretraining and foundation models. “IL”,“VL” denotes image-language pretraining and video-language pretraining, “Non-Gen” denotes non-generative tasks (e.g., visual only classification, cross-modal alignment), while “Gen” denotes multi-modal generation tasks (e.g., image/video question answering,captioning). “I-L,V-L” and “I-T,V-T” denote image/video-label and image/video-text data respectively.
In this paper, we take a step forward and aim to design an omni-vision-language foundation model OmniVL, to support both image-language and video-language pretraining and corresponding downstream tasks111Here we regard models like 
Yuan et al. ( 2021 ); Yu et al. ( 2022 )
as image-language only, as they only pretrain on image-language and naively regard video as independent frames without temporal modeling or need heavy adaption to video., including visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning) simultaneously. To the best of our knowledge, it is the first time to demonstrate that one model can benefit both image and video tasks bidirectionally, as opposed to conventional single directional way, i.e., using image (/image-language) to help video(/video-language).
To support both image and video inputs, OmniVL adopts a unified transformer-based visual encoder to extract visual representations, where video inputs share most transformer layers with images except for the 3D patch tokenizer and temporal attention blocks 
Bertasius et al. ( 2021 )
. Similar to existing vision-language models, OmniVL has another text encoder to extract language representations. To support multiple tasks learning within the same architecture, OmniVL follows an encoder-decoder structure with two visual-grounded decoders. One decoder is designed with bidirectional attention for visual-text semantic alignment, while the other is equipped with causal attention for text generation . We pretrain OmniVL with image-language and video-language data in a
decoupled joint way, which is different from existing works 
Li et al. ( 2022 ); Yu et al. ( 2022 ); Yuan et al. ( 2021 ); Radford et al. ( 2021 ); Zellers et al. ( 2021 )
that apply image-language only pretraining, video-language only pretraining or their joint pretraining from scratch. More specifically, we first pretrain on image-language to focus on spatial representation learning, and then do joint pretraining with video-language together to learn the temporal dynamics incrementally while preserving/polishing the well-learned spatial representations. We believe this not only makes the learning more efficient from spatial to temporal dimensions, but also enforces the learning complementary to each other. This bidirectional help has not been unraveled in prior works, and is important in pushing one foundation model to boost the performance on both image and video tasks.
Moreover, OmniVL is motivated by the unified contrastive learning 
Yang et al. ( 2022 )
used in Florence 
Yuan et al. ( 2021 )
, and extends its scope to cover video-text and video-label (e.g., video action recognition) data. The underlying consideration lies in two aspects: 1) As mentioned above, we aim to leverage as much supervised (or noisily supervised) pretraining corpus as possible; 2) As shown in 
Yang et al. ( 2022 )
Deng et al. ( 2009 )
) can help to derive more discriminative representations and benefit transfer learning tasks (
e.g., image classification), while webly-crawled vision-language data cover broader visual concepts and benefit cross-modal and multi-modal tasks. This simple extension facilitates us to enjoy both advantages.
We call our foundation model OmniVL, since it unifies in three dimentions: modality (i.e., image-language and video-language pretrainings), functionality (i.e., non-generative and generative tasks), and data unification (i.e., image-text, video-text, image-label and video-label data) as demonstrated in Table  1 . With the similar model size and data scale, OmniVL achieves new state-of-the-art or at least competitive results on a wide scope of downstream tasks. For example, when using ViT-Base scale model to pretrain on a moderate data scale (e.g., ∼ 14M image-text, ∼
2.5M video-text), we achieve state-of-the-art performance on image-text retrieval (82.1/64.8 R@1 on COCO for image-to-text/ text-to-image ), image captioning (39.8 BLEU@4 on COCO), text-to-video retrieval (47.8 R@1 on MSRVTT), and video question answering (51.9% accuracy on MSVD).
2 Related Work
Vision-Only Pretraining.
Large-scale pretraining plays a key role in the success of deep neural networks recently. In the computer vision field, supervised pretraining 
He et al. ( 2016 , 2019 ); Kolesnikov et al. ( 2020 ); Dong et al. ( 2022 )
is the most classical setting. Recently, BiT 
Kolesnikov et al. ( 2020 )
shows that supervised pretraining on larger-scale datasets with larger models offers better transfer ability. In parallel, self-supervised pretraining has also been extensively studied in the literature, and dominant methods include contrastive learning 
Chen et al. ( 2020a ); Li et al. ( 2021c ); He et al. ( 2020 )
approaches or BERT-pretraining strategies 
Dong et al. ( 2021 ); Bao et al. ( 2021 ); Wang et al. ( 2022b )
. Despite their great success, they focus on unimodal pretraining and fail to support cross-modal or multi-modal tasks.
Vision-Language Pretraining. Vision-Language pretraining (VLP) 
Lu et al. ( 2019 ); Tan and Bansal ( 2019 ); Sun et al. ( 2019 ); Chen et al. ( 2020b ); Su et al. ( 2020 ); Radford et al. ( 2021 ); Jia et al. ( 2021 )
has attracted surging attention in the vision-language community, which aims to learn generic multi-modal representations to solve various tasks, e.g., image captioning, image-text retrieval, and video question answering. Depending on the modality of the input data and targeted downstream tasks, existing VLP approaches can be roughly divided into two categories: image-language pretraining methods 
Chen et al. ( 2020b ); Li et al. ( 2020c ); Zhou et al. ( 2020 ); Wang et al. ( 2022c )
which learn a joint distribution over visual and linguistic representations from image-text pairs, and video-language methods 
Li et al. ( 2020a ); Lei et al. ( 2021 ); Li et al. ( 2021a ); Fu et al. ( 2021 ); Bain et al. ( 2021 ); Miech et al. ( 2020 ); Alayrac et al. ( 2020 ); Akbari et al. ( 2021 ); Patrick et al. ( 2020 )
which model the semantic associations between video frames and texts from video-text pairs. Among them, some recent works 
Bain et al. ( 2021 ); Fu et al. ( 2021 )
also explore image-language and video-language joint pretraining to improve video-language tasks. Instead, OmniVL aims to integrate image-language and video-language within one foundation model. Moreover, inspired by the observation in BEVT
Wang et al. ( 2022b )
that decoupling spatial and temporal learning is better than direct joint spatial-temporal learning, we introduce a decoupled joint pretraining paradigm, which first learns spatial visual representations with image-language and then conducts joint pretraining. With such a design, we demonstrate for the first time that they can help each other in a bidirectional way. Moreover, as a foundation model, we enable more unification in terms of functionality and pretraining corpus.
Vision Foundation Models. Automating the understanding of our multi-modal world with machines requires the development of foundation models that work across different modalities and domains 
Bommasani et al. ( 2021 ); Lu et al. ( 2021 )
and ALIGN 
Jia et al. ( 2021 )
are typically regarded as the pioneering explorations of foundation models. By pretraining on web-scale noisy image-text pair data, they excel at cross-modal alignment and zero-shot classification tasks. Florence 
Yuan et al. ( 2021 )
further extends the scope of foundation models to cover Space-Time-Modality space and performs better especially on vision-only tasks with unified contrastive learning. Despite their success, all the above approaches do not naturally support multi-modal generation tasks (e.g., visual question answering and captioning). To address this limitation, some recent works like FLAVA 
Singh et al. ( 2021 )
and CoCa 
Yu et al. ( 2022 )
design one image-language foundation model to support both cross-modal alignment tasks and multi-modal generation tasks. While such image-language foundation models can be extended to support video-language tasks in the fine-tuning stage, they either need heavy task-specific adaptors or simply treat video as independent frames. In contrast, OmniVL is designed to support both image-language and video-language starting from the pretraining stage without any extra adaptors.
3 Methodology
3.1 Overall Framework
The overall framework of OmniVL is illustrated in Figure  1 , which follows an encoder-decoder like structure. OmniVL consists of a unified visual encoder to extract the representations for both images and videos, a text encoder to obtain text representations, and two visual-grounded decoders for semantic alignment and open-ended text generation, respectively. Below we briefly introduce each component and leave the detailed structure in the supplementary material.
Unified Visual Encoder. We unify images and videos in a transformer-based visual encoder by converting both of them into a series of tokens, where the independent 2D/3D convolution-based patch tokenizers are used for image/video respectively. Accordingly, spatial and temporal positional encodings are added to the input tokens to incorporate positional information. For the transformer structure, we follow TimeSformer 
Bertasius et al. ( 2021 )
to employ decoupled spatial-temporal attention, which individually models the static spatial appearance and temporal dynamics in visual data. Specifically, within each transformer block, we sequentially perform temporal self-attention and spatial self-attention. The temporal self-attention blocks will be automatically skipped for the image inputs. The final visual representation vcls is obtained from the [CLS] token of the last block. Note that we share the model weights for image and video inputs except for the temporal self-attention.
Figure 1: An overview of OmniVL. We unify the pretraining corpus (human-annotated data and webly-crawled data), modality (image, video, and language), and functionality (multi-modal understanding and generation tasks, visual classification tasks) in one universal framework.
Text Encoder. We adopt BERT 
Devlin et al. ( 2019 )
as the Text Encoder, which transforms input text into a sequence of token embeddings. The embedding of [CLS] token wcls is used as the language representation.
Visual-grounded Alignment Decoder. Even though the above unimodal encoders can support cross-modal alignment like CLIP 
Radford et al. ( 2021 )
, we employ an extra visual-grounded alignment decoder to further facilitate the learning and enhance the alignment accuracy like 
Li et al. ( 2022 ); Fu et al. ( 2021 )
. It takes the text and output visual features from the unified visual encoder as input, and fuses the information of both modalities with stacked transformer blocks. Each block basically contains a self-attention layer, a cross-attention layer and a feed-forward layer. Additionally, a task-specific [ENC] token is added to the input text, the output embedding of which will be used as the fused cross-modal representation.
Visual-grounded Generation Decoder. We empower our model to own the multi-modal generation capability by attaching a visual-grounded text generation decoder. It adopts the similar architecture to the above alignment decoder, but replaces the bidirectional self-attention with causal self-attention. A [DEC] token and an [EOS] token are added to indicate the task type and signal the end, separately.
3.2 Pre-training Objectives
We jointly optimize OmniVL with the following three objectives:
Unified Vision-Language Contrastive (UniVLC) Loss. UniCL 
Yang et al. ( 2022 )
introduces a novel paradigm for visual representation learning by unifying the supervised learning from image-label data and contrastive learning from the natural language supervision. In this paper, we extend its scope to the unified visual domain, which incorporates both image and video data for cross-modal pretraining via a joint visual-label-text space.
More specifically, we define manually-annotated image/video-label data and web-crawled image/video-text data in a triplet format S=(x,y,t), where x∈X is the image/video data, y∈Y is the unique label indicating the index of the grouped language description in the whole pretrain dataset, and t∈T is its corresponding language description. For image/video-label data, t is generated with the same prompt strategy used in CLIP 
Radford et al. ( 2021 )
and ActionCLIP 
Wang et al. ( 2021a )
(i.e., filling the class names into the prompt templates). Note that in this joint visual-label-text space, visual data from manually-annotated dataset belonging to the same category shares the common textual description.
Based on this, given the visual embedding of image/video xi and the language embedding of its text ti in a batch B
, we follow CLIP to apply a linear projection and normalization layer on them to obtain the latent visual vector
vi and text vector wi. To enjoy a large batch size for contrastive learning, we maintain three memory banks as 
He et al. ( 2020 ); Li et al. ( 2021b )
to store the most recent M visual vectors {vm}Mm=1 and text vectors {wm}Mm=1 from the momentum encoders, and the corresponding labels {ym}Mm=1. Then we calculate the vision-to-text and text-to-vision contrastive loss as:
where k∈P(i)={k|k∈M,yk=yi}, and τ is a learnable temperature parameter. Finally, the unified vision-language contrastive loss is defined as:
where θve and θte denote the parameters of the unified visual encoder and text encoder.
Vision-Language Matching (VLM) Loss. VLM loss encourages the model to learn aligned visual and text representations. Specifically, we randomly replace the text ti for xi with the text tj from a different image/video in the same batch B
, and input them to the unified visual encoder and visual-grounded alignment decoder, respectively. Then a linear layer is applied to the output of visual-grounded alignment decoder to produce a two-category probability
pvlm, which measures whether the input pair is matched. Finally, we optimize the parameters of the unified visual encoder θve and the parameters of visual-grounded alignment decoder θad with VLM loss:
where yvlm = 1 if j∈B and yj=yi, otherwise yvlm = 0.
Language Modeling (LM) Loss. Previous works indicate that LM facilitates the model to develop better text-induced generalization ability 
Wang et al. ( 2022c )
. Therefore, we optimize the output of visual-grounded generation decoder with a cross-entropy loss, which directly maximizes the likelihood of the input text sequence in an autoregressive manner:

Images Powered by Shutterstock