Visual Question Answering (VQA)

Computer Vision

LSTM

CNN

This is my end of studies’ project about Visual Question Answering (VQA) as a student in the Information Technology and Cybersecurity Department at INSA CVL. The project implements a novel co-attention model presented by Lu et al. (2016) in Keras/Tensorflow.

Author

Van Tuan Bui

Published

April 3, 2025

sourced from Lu et al. (2016).

Overview

To correctly answer visual questions about an image, the machine needs to understand both the image and question. A model that can jointly reasons about image and question attention could improve the state-of-the-art on the VQA problem. So I decided to study the paper and experienced this novel mechanism by myself. In this repository only parallel co-attention mechanism which generates image and question attention simultaneously is implemented.

Architecture

STEP 1: Extract image features from a pre-trained CNN (VGG19 is used here).
STEP 2: Compute word embedding, phrase embedding and question embedding
STEP 3: Calculate co-attended image and question features from all three levels (word, phrase, question)
STEP 4: Use a multi-layer perceptron (MLP) to recursively encode the attention features

Dataset

I evaluate the proposed model on the VQA 2 dataset. The dataset contains 443 757 training questions, 214 354 validation questions, 447 793 testing questions, and a total of 6 581 110 question-answers pairs. There are three sub-categories according to answer-types including yes/no, number, and other. Each question has 10 free-response answers. The paper uses the top 1000 most frequent answers as the possible outputs. This set of answers covers 87.36% of the train+val answers. For testing, I train the model on VQA train+val and report the test-dev and test-standard results from the VQA evaluation server like in the paper.

Results

Model	Yes/No	Number	Other	All
VGG	66.61	31.39	33.74	47.02
ResNet	69.08	34.58	38.45	50.73

Some prediction answers on the test-standard:
Example 1
Example 2
Example 3

References

Lu, Jiasen, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. “Hierarchical Question-Image Co-Attention for Visual Question Answering.” Advances in Neural Information Processing Systems 29.