Profiling Neural Networks to improve model training and inference speed
Part 1: Learn the basics of performance engineering on neural networks
In a previous post, we studied how to teach the Anki Vector robot to recognize the human sign language. Specifically, we trained a custom Convolutional Neural Network (CNN) with a labelled dataset of 8500 images of human signs taken from Vector’s camera. We demonstrated how the trained CNN can be used to detect human signs in this video. We also explored the tradeoffs between a small custom build CNN model and a large scale well recognized RESNET model. Similar efforts have been made by other researchers; such as an effort to teach Anki Cozmo to learn the human sign language.
In this post, we will figure out ways to optimize this model in terms of the time it takes to train a model (speed of training), and the time it takes to classify an image using this model (speed of inference). We can break down this process into several steps.
Profile your existing model, and find opportunities for improvement.
Make corresponding changes in your model to achieve the desired improvement
Rerun your training pipeline, re-profile, and measure if you got the improvements.
In this part, we will focus on Step 1, profiling the performance of your model. In future parts, we will study how to improve the performance of your model.
Profiling for performance
If your project is based on Tensorflow, the easiest way to profile your model is using TensorBoard Profiler. Similar performance profilers are available for PyTorch. This article walks you through the steps of trying to optimize a GPU for performance. In our specific case, we will profile the performance of the RESNET model which is trained to learn the human sign recognition language. This Colab notebook demonstrates how to run the TensorBoard profiler while training RESNET. In short, we trained RESNET for 5 epochs, and achieved an accuracy of 94%; at the same time we collected profiles to understand the performance of training the model.
Here is a snapshot from the profiler.
A few important observations can be made form the above profile.
Each step of training a model took 631ms. A step consists of one iteration of updating the parameters of a neural network model based on a segment of training data. For each of these steps of training a neural network model, the CPU needs to offload the model parameters to the GPU and fire off the computations. So in other words, the time that it took for one round of updates to the model parameters was 631ms.
The majority time (~590 ms) falls under the category of device compute time (see light green curve), which is the time it takes for the GPU to compute the matrix multiplications to derive the error and the derivatives to derive the gradient of the error curve (Its okay if you do not understand the details of how a CNN is trained for this exercise).
About 23ms (3.7%) falls in the category of Kernel Launch time. This is the time it takes for the CPU to offload the data to a GPU. TensorBoard Profile notes an optimization that can be made in this step. We will discuss this optimization later.
TensorBoard Profile notes that none of computations are based on Floating Point 16 (FP16) arithmetic (Noted in the black circle). This presents a ripe ground for improvements, and we will discuss how to make this improvement, and the tradeoff it presents in the next part.
Thank you for reading. Please subscribe for free to my newsletter to get updates on the next parts of this series.