Integrating Vector with Open AI CLIP

Open AI rang in the new year with a major announcement: two new revolutionary pieces of research: 1)DALL-E which can generate images from…

Jan 18, 2021

Open AI rang in the new year with a major announcement: two new revolutionary pieces of research: 1)DALL-E which can generate images from text, and 2)CLIP which provides a one-shot image classification approach without the requirement of training a model. This article focuses on CLIP, specifically, how the Vector robot can classify objects that it sees as long as an input list of possible text sequences that describe the expected objects is provided.

First, what is the big deal about CLIP. One of the major challenges in deep learning is the requirement of labelled datasets required to train a model. While many public datasets are available to help you learn how to train neural networks, building a labelled data set for the specific requirement of one’s application is a major challenge. Worse, a model trained on a labelled data set would be good only for the same use case with which the labelled data set was generated. This limits the use of machine learning or deep learning to the folks who can afford building a labelled data set for their use case. As an example, if I needed to train Vector to understand all the objects in my room, I would have to generate a data set of many pictures of my room taken by Vector’s camera, and then label the objects in those images. You can easily see that this approach lacks scalability and has high manual costs. Hence the application of deep learning is still limited to big companies who can absorb the manual efforts behind this process. It also shows why services such as Amazon Mechanical Turk are so popular.

Contrastive Language–Image Pre-training (CLIP) addresses this problem by training a model on annotated images available from the Internet. It thus blends image classification and Natural Language Processing (NLP). Because it trains a model on 400 Million publicly available annotated images, it is not biased towards any particular data set; yet it can match the best results available on many of the commercial benchmarks. If you are interested in the details of CLIP, I would recommend the following resources, besides the paper and introduction on Open AI’s website.

Roboflow’s guide on “How to use Open AI Clip”. Roboflow provides the Vector recognition dataset as a public repository and provides you tools to download the data set in a format that CLIP can use.
An interesting tutorial on how you can use CLIP

Now, lets get to the meat of this paper in which we will explore how to integrate CLIP with Vector and have Vector get a better understanding of its surroundings. The full program is available in my git repository.

The integration is actually pretty simple. Here is the main part of the code… An image taken by vector is combined with input text sequences to find the probability of image classification against the same text sequences.

def detectPicture(robot, choices):   
    """Captures a single image with the help of vector, and asks CLIP to classify      it based on the choices provided.   """   
    vectorImage = robot.camera.capture_single_image()   
    image = preprocess(vectorImage.raw_image).unsqueeze(0).to(device)   
    with torch.no_grad():      
         image_features = model.encode_image(image)      
         text = clip.tokenize(choices).to(device)      
         text_features = model.encode_text(text)         
         logits_per_image, logits_per_text = model(image, text)
         probs = logits_per_image.softmax(dim=-1).cpu().numpy()

for text, prob in zip(choices, probs[0]):      
     print(text + ":" + "{:.2f}".format(prob))

Now, let us examine the power of CLIP in a few settings. I wanted Vector to detect objects in my kid’s room, these objects are mainly plush toys. I placed a plush giraffe in front of Vector. Here is an example image.

Now, lets run the program with the following inputs:

python detectPicture.py -c “a giraffe, a toy, a bus, a train, a plush giraffe, a baby’s plush giraffe, a baby”

And here is the output…

a giraffe:0.08 a toy:0.00 a bus:0.00 a train:0.00 a plush giraffe:0.50 a baby's plush giraffe:0.41 a baby:0.00

Its impressive that CLIP distinguish between a “real” giraffe and a “plush giraffe”.

Let us try one more example…

Now, lets run the program with the following inputs.

python detectPicture.py -c “a bus, a toy, a train, a toy bus to play with, toddler’s toy bus, a toddler”

a bus:0.00 a toy:0.00a train:0.00a toy bus to play with:0.08a toddler's toy bus:0.92a toddler:0.00

Here again, CLIP was effective in distinguishing and classifying that the object was actually a toddler’s toy bus.

While CLIP is impressive, the authors have also done a thorough job in evaluating the deficiencies of CLIP. CLIP definitely performs differently based on the input strings. CLIP is not good a detecting the number of objects in the image. CLIP also fails to understand the distances between objects. In spite of these limitations, it is clear that CLIP is a remarkable breakthrough in the fiels of deep learning and image classifications. Stay tuned as we explore more features and deficiencies of CLIP.

Please follow my publication: “Programming Robots” for more interesting tips. I also have an online course to teach AI with the help of Vector available at: http://robotics.thinkific.com I will feel honored to have you as a student.

Learn With A Robot

Discussion about this post