Gemini Robotics on device
3 months ago, we discussed the first Vision Language Action (VLA) model from Google Deepmind, called Gemini Robotics. This week, Google Deepmind released a flavor of this model called Gemini Robotics on-device which, as the name suggests can run directly on a robot. Almost all existing VLA models are run on backend machines or a laptop, which implies that the robot is constantly in communication with the backend server to receive instructions from the VLA model for the next action chunk to execute (An action chunk is a sequence of actions that need to be executed by the robot). This makes the robot actions sensitive to the speed of running inference on the VLA model as well as network latency and jitter. Robots play delayed actions and become slow to respond. This slowness often becomes the critical difference between a robot grasping a cup neatly versus spilling the contents in the cup.
Solving the inference delay
Many research groups and companies such as Physical Intelligence (Pi) ha…