Google publishes Wikipedia-based image text dataset (WIT)



Google recently released a Wikipedia-Based Image Text (WIT) dataset, a large multimodal dataset created by extracting various selections of text associated with an image from links to images and Wikimedia articles. It was done by rigorous filtering to maintain high quality image-to-text sets.

The WIT dataset is available for download on GitHub.

Register for our upcoming Masterclass>>

As part of its initiatives to fill the knowledge gaps in Wikipedia, Wikimedia Research, in partnership with Google and other external collaborators such as EPFL, Naver Labs Europe and Hugging Face, is organizing a competition with the ensemble WIT data to Kaggle.

Check out the Wikipedia – Image / Caption Match Challenge details here.

Meeting the challenge of real-world datasets

To model the relationship between text and images, multimodal visio-linguistic models are based on rich data sets. Traditionally, these datasets have been created by manually captioning images or crawling the web and extracting the alt text as a caption.

Looking for a job change? Let us help you.

The first approach tends to produce better quality data, while the second limits the amount of data that can be generated / created. While the automated extraction approach leads to larger data sets, these require careful heuristics and filtering to scale models and ensure data quality for high performance.

Another challenge with existing datasets is the lack of coverage in languages ​​other than English. To address these issues, Google researchers developed the WIT dataset, with the goal of creating a large, high-quality, multilingual dataset with a variety of content.

WIT vs other datasets

As explained in “WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning,” the dataset resulted in an organized set of 37.5 million image text examples. feature rich, as well as 11.5 million unique images in 108 languages, as presented at SIGIR 2021. SIGIR is a leading scientific conference in the field of documentary research.

WIT has increased language coverage and large size compared to previous datasets.

(Source: Google)

Here are some of the advantages of the WIT dataset:

  • Cut: It is one of the largest multimodal datasets of sample images and text available to the public or open access.
  • Multilingual: WIT has 10 times or more languages ​​than any other dataset (108) languages.
  • Contextual information: WIT includes a lot of contextual information at the page and section level, unlike typical multimodal datasets, which only have one caption per image.
  • Real world entities: As a large knowledge base, Wikipedia is rich in real world entities that are represented in WIT.
  • Demanding test set: All SOTA models demonstrated significantly lower performance on WIT compared to standard evaluation sets. (Example: ~ 30 booster drop points)

The ideation of WIT

Google researchers said the main goal is to create a large dataset without compromising the quality or coverage of concepts / ideas. Therefore, they started using the largest online encyclopedia available today – Wikipedia.

To give you an example, consider the Wikipedia page for “Half Dome (Yosemite National Park, CA).” As noted below, the article contains various interesting text captions and contextual information relevant to the image, including the page title, description of the main page, and other contextual information and metadata.

(Source: Google / Wikipedia page for Half Dome: photo by DAVID ILIFF)
Introducing the Wikipedia page for the Half Dome image (Source: Google / Wikipedia page for Half Dome: Photo by DAVID ILIFF)

This is how they did it

The researchers said they started by selecting Wikipedia pages with images, then extracted various image-to-text associations and surrounding contexts. Then, by further refining the data, the researchers performed a rigorous filtering process to ensure the quality of the data. This included filtering based on the text, availability, length and quality of the captions (for example, removing the default generic filter text); filtering based on images to ensure that each image has a certain size with an authorized license; and image and text-based filtering to ensure search suitability (such as excluding those classified as hate speech).

In addition, the researchers randomly sampled sets of image captions for evaluation by human editors, who verified that 98% of the samples exhibited good image caption alignment.

Kaggle contest

The competition includes a task of recovering images and texts. Using images and text captions, user participants must retrieve the appropriate caption (s) for each image.

To enable research in this area, Wikipedia has made available 300 pixel resolution images and ResNet-50 based image integration for most training and testing data sets. In addition to the WIT dataset, Kaggle will host all of this image data and provide Colab notebooks.

Additionally, participants will have access to a discussion board in Kaggle to share code and collaborate, allowing anyone interested in multimodality to get started and run experiments seamlessly.


Google believes the WIT dataset will help researchers create better multi-modal multilingual models and identify better representation techniques, which will improve machine learning models in real-world tasks compared to data. visio-linguistic.

Join our Discord server. Be part of an engaging online community. Join here.

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.

Amit Raja Naik

Amit Raja Naik

Amit Raja Naik is Editor-in-Chief at Analytics India Magazine, where he dives deep into the latest technological innovations. He is also a professional bass player.


Leave A Reply

Your email address will not be published.