Chapter 2 of Introduction to MLOps: Creating a great dataset
Successful ML depends on High Quality Data
The success of a supervised ML approach depends on the quality of the data it is trained on. While many high volume and high quality datasets exist in public repositories such as Kaggle, these are not of much use to use if your goal is to bootstrap a ML application for your setting.
In this chapter, we discuss how to create a great computer vision dataset with the help of Roboflow.com. We start from how to collect images specific to your environment, such as a home robot in my case. We discuss the different aspects we need to keep in mind to create a diversified and representative dataset. We visit the different strategies to label images, including how you can use ML to label with the help of Roboflow label assist. We have found that label assist reduces the labeling effort by a factor of ~10x and strongly recommend the same.
We also discuss ways to post process and augment the dataset while checking on the dataset health at the same time.
I am still in the process of developing all the chapters of this course. Any feedback or suggestions for improvement would certainly be greatly appreciated. Many thanks for listening to the videos.