Attention And Vision In Language Processing
This volume provides a comprehensive overview of the nature of attentional and visual processes involved in language comprehension. Key concerns include how linguistic and non-linguistic processes jointly determine language comprehension and production and how the linguistic system interfaces with perceptual systems and attention.
Attention and Vision in Language Processing
Language scientists have traditionally considered language in isolation from other cognitive and perceptual systems such as attention, vision and memory. In recent years, however, it has become increasingly clear that language comprehension must be studied within interaction contexts. The study of multimodal interactions and attentional processes during language processing has thus become an important theoretical focus that guides many research programs in psycholinguistics and related fields.
Ramesh Kumar Mishra has a Ph.D. from the University of Delhi and is currently associate professor and Head at the Center for Neural and Cognitive Sciences, University of Hyderabad . He earlier taught at the Centre of Behavioural and Cognitive Sciences, University of Allahabad, India. Dr. Mishra has published widely in the area of psycholinguistics and cognitive science of language and has edited books in the area of language and cognition. His current research focuses on how language and other important cognitive processes like attention and vision interact during cognitive processing. Dr. Ramesh Mishra is on the editorial board of journals like Journal of Theoretical and Artifi cial Intelligence, Frontiers in Cognition. He also co-edits the International Journal of Brain, Culture and Cognition.
Falk Huettig is a Senior Investigator at the Max Planck Institute for Psycholinguistics in Nijmegen, Netherlands and a Visiting Professor at the University of Hyderabad, India. He received a BSc and a MSc from the University of Edinburgh and a PhD in psychology from the University of York, UK. His main research interest is multimodal cognition. Other main interests include the effect of cultural inventions such as reading on general cognition in children, illiterate adults, and individuals with reading impairments; and predictive language processing. Falk Huettig is an editorial board member of the Journal of Memory and Language and editor-in-chief of Brain, Cognition and Culture.
In contrast, models such as VisualBERT 3, VL-BERT 7, UNITER 10 encode both modalities within the same module. For example, VisualBERT combines image regions and language with a transformer in order for self-attention to discover alignments between them. In essence, they added a visual embedding to the standard BERT architecture. The visual embedding consists of :
The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements -- like regions and words -- proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.
Processing images to generate text, such as image captioning and visual question-answering, has been studied for years. Traditionally such systems rely on an object detection network as a vision encoder to capture visual features and then produce text via a text decoder. Given a large amount of existing literature, in this post, I would like to only focus on one approach for solving vision language tasks, which is to extend pre-trained generalized language models to be capable of consuming visual signals.
SimVLM (Simple Visual Language Model; Wang et al. 2022) is a simple prefix language model, where the prefix sequence is processed with bi-directional attention like BERT, but the main input sequence only has causal attention like GPT. Images are encoded as prefix tokens such that the model can fully consume the visual information and then generates associated text in an autoregressive manner.
Inspired by prefix or prompt tuning, both Frozen (Tsimpoukelli et al. 2021) and ClipCap (Mokady, Hertz & Hertz, 2021) only update the parameters of the vision module during training to produce image embeddings that can work with a pretrained, frozen language model. Both are trained with aligned image caption datasets to produce the next text token in caption conditioned on the image and previous text tokens. The powerful language capability is retained by freezing LM parameters. In addition, even though such setup is trained with limited image caption data, they can also rely on the encyclopedic knowledge of the language model at test time.
ClipCap relies on CLIP (Radford et al. 2021) for vision encoding, but it needs to be processed by a light mapping network $F$ such that image embedding vectors are translated into the same semantic space as the pre-trained LM. The network $F$ maps CLIP embeddings into a sequence of $k$ embedding vectors, each with the same dimension as a word embedding in GPT2. Increasing the prefix size $k$ helps improve the performance. Both CLIP vision encoder and the LM are frozen during training and only the mapping network $F$ is learned. They found that when LM is frozen, $F$ should be a transformer, with 8 multi-head self-attention layers with 8 heads each, but when LM can be fine-tuned, a MLP is enough.
To more efficiently fuse visual information into different layers of the language model, we can consider a specially designed cross-attention fuse mechanism to balance the mixture of text generation capacity and visual information.
Each video $\mathcalV$ is split into multiple segments $\ \boldsymbols_t \$, each segment $\boldsymbols_t$ containing an image frame $\mathbfI_t$ extracted from the middle timestep and $L=32$ tokens of words associated. Images are encoded by a learned image encoder and words are encoded using a learned embedding. Then both are encoded together within a joint vision-language transformer.
Flamingo (Alayrac et al. 2022) is a visual language model that accepts text interleaved with images/videos and outputs free-form text. Flamingo connects a pretrained LM and a pretrained vision encoder (i.e. CLIP image encoder) via a transformer-based mapper. To more efficiently incorporate vision signals, Flamingo adopts a Perceiver-based architecture to produce a few hundreds of tokens out of a large number of visual input features and then use cross-attention layers interleaved with the LM layers to fuse visual information into the language decoding process. The training objective is an autoregressive, NLL loss.
Similar to ClipCap, both pretrained models are frozen during training and thus Flamingo is only trained to harmoniously connect existing, powerful language and vision models together. Tha main difference between ClipCap and Flamingo is that the former treats the image embedding as simple prefix for LM, while the latter uses the gated cross-attention-dense layer to fuse image information. In addition, Flamingo incorporates a lot more training data than ClipCap.
Transformer models have become the de-facto status quo in Natural Language Processing (NLP). For example, the popular ChatGPT AI chatbot is a transformer-based language model. Specifically, it is based on the GPT (Generative Pre-trained Transformer) architecture, which uses self-attention mechanisms to model the dependencies between words in a text.
While the Transformer architecture has become the highest standard for tasks involving Natural Language Processing (NLP), its use cases relating to Computer Vision (CV) remain only a few. In many computer vision tasks, attention is either used in conjunction with convolutional networks (CNN) or used to substitute certain aspects of convolutional networks while keeping their entire composition intact. Popular image recognition algorithms include ResNet, VGG, YOLOv3, and YOLOv7.
In the following, we highlight some of the most significant vision transformers that have been developed over the years. They are based on the transformer architecture, which was originally proposed for natural language processing (NLP) in 2017.
A transformer in machine learning is a deep learning model that uses the mechanisms of attention, differentially weighing the significance of each part of the input sequence of data. Transformers in machine learning are composed of multiple self-attention layers. They are primarily used in the AI subfields of natural language processing (NLP) and computer vision (CV).
However, recent advancements in transformer architecture, which was originally introduced for natural language processing (NLP), have shown great promise in achieving competitive results in image classification tasks.
An example is CrossViT, a cross-attention Vision Transformer for Image Classification. Computer vision research indicates that when pre-trained with a sufficient amount of data, ViT models are at least as robust as ResNet models.
Hence, self-attention is a computational primitive used to quantify pairwise entity interactions that help a network to learn the hierarchies and alignments present inside input data. Attention has proven to be a key element for vision networks to achieve higher robustness. 041b061a72