Playing with OpenAI CLIP on multiple datasets

In my last article, we examined how OpenAI CLIP can classify an image amongst multiple options of provided text (prompts) using a…

Jan 28, 2021

In my last article, we examined how OpenAI CLIP can classify an image amongst multiple options of provided text (prompts) using a pre-trained model; thus providing the ability of zero shot image classification. We have also examined how Roboflow.ai helps you automate your data ingestion pipeline by providing you with a large variety of data preprocessing and augmentation techniques and the ability to export the dataset in multiple formats.

Thanks, to this nice notebook provide by Roboflow.ai, one can now take a shot at understanding the potential of OpenAI CLIP inference on multiple datasets, including the Anki Vector Robot dataset. While this article will solely consider the Anki Vector dataset, you can try the same approach with other public datasets available with Roboflow. Here are the steps:

Make a clone of the Colab notebook provided by Roboflow.ai in your Google Drive. A simple “Save a copy in Drive” under the File menu would work.
Go to the Anki Vector Robot dataset. Click on “Open AI CLIP Classification” to get the ability to get a downloadable link to the dataset in CLIP format. The following screenshot should help you… You would need to use this link in the third cell of the notebook in the section of “Download Classification Data” (Note: The notebook talks about the public flowers dataset which you are welcome to try as well, but we will limit our discussion to the Anki Vector dataset.)

3. Roboflow generates some standard tokenizations for the dataset which are in the file ./test/_tokenization.txt The number of tokens depend on the number of labelled objects. In our case, we have two distinct types of images, ones with no robots, and ones with a Vector in it. The list of classes in Cell 4 will show you the two classes:

['vector', 'empty']

We will be examining if CLIP can distinguish between the two, without any pre-training. Because CLIP doesn’t need any training before inference, it belongs to the class of methods known as Zero-shot learning.

You can overwrite the tokenizations in Cell 6 which talks about trying to edit the prompts. Remember that the success of CLIP greatly relies on these prompts, because essentially CLIP is generating a probability distribution across these prompts from its trained model which has seen a vast collection of these prompts before as annotated texts to images in the Internet. This exercise will focus on testing how the performance of CLIP changes with different prompts. Let us try overwriting the prompts to the following.

#edit your prompts as you see fit here%%writefile ./test/_tokenization.txtAn example picture of a robot which is the Vector manufactured by AnkiAn example picture with no robot

Now you can run all the cells. The easiest way to do this is by going to Runtime and clicking Run All. Also make sure that your notebook is configured to use a GPU (provided for free by Google Colab). You can do so in the tab.. Edit -> Notebook Settings. Make sure that the GPU box is checked.

Let us examine the result in the final cell… I get the following result

accuracy on class vector is :0.7407407407407407accuracy on class empty is :0.0accuracy on all is : 0.6666666666666666

which implies that while CLIP was somewhat successful in detecting the images of Vector, it completely failed in classifying the pictures having no Vectors.

Remember that the performance of CLIP depends on the tokens provide. Let us try editing the tokens to examine if that can give us a better result. We try the following prompts…

#edit your prompts as you see fit here%%writefile ./test/_tokenization.txta robotempty

Now re-run all the subsequent cells. And the result is..

accuracy on class vector is :0.8518518518518519accuracy on class empty is :1.0accuracy on all is : 0.8666666666666667

Voila! CLIP now is 100% successful with the empty images and 85% successful with the images with Vector in it.

Now, let us try a different prompt…

#edit your prompts as you see fit here%%writefile ./test/_tokenization.txtA picture of a robotA picture without a robot

Now rerun all subsequent cells, and the result…

accuracy on class vector is :0.9259259259259259accuracy on class empty is :1.0accuracy on all is : 0.9333333333333333

That’s a bit better, we reduced the incorrect classifications by another 50% (15% incorrect in the previous example to 7.5% incorrect now)

This is about the best we could get. CLIP is great, but as we have seen, it also needs careful calibration before deployment. For example, just a sight change of prompts could make it perform poorly.

#edit your prompts as you see fit here%%writefile ./test/_tokenization.txtAn example picture of a robot which has a display, wheels, and a forklift. An example picture without a robot.

And the result is…

accuracy on class vector is :0.7777777777777778accuracy on class empty is :0.0accuracy on all is : 0.7

We regress considerably.

Hope you enjoy playing with the notebook and the Roboflow.ai public datasets. If you have any questions, or thoughts, please leave it in the comments below. Please follow my publication: “Programming Robots” for more interesting articles. I also have an online course to teach AI with the help of Vector available at: https://robotics.thinkific.com I will feel honored to have you as a student.

Learn With A Robot

Discussion about this post