The goal of this blog is to cover the key topics to consider in operationalizing machine learning and to provide a practical guide for navigating the modern tools available along the way. To that end, the subsequent blogs will include further detailed architecture concepts and help you apply them to your own model pipelines.
This blog series will not explain machine learning concepts but rather to tackle the auxiliary challenges like dealing with large data sets, computational requirements and optimizations, and the deployment of models and data to large software systems.
Most classical software applications are deterministic where the developer writes explicit lines of code that encapsulate the logic for the desired behavior.
Whereas, the ML software applications are probabilistic where the developer writes a more abstract code and lets the computer write the code in a human unfriendly language i.e. the weights or parameters required for the ML model. Andrej Karpathy has written in detail about pointing to the same difference in his blog.
This requires us to look at new ways of getting data, cleaning, training, and deployments methods since apart from code, we have the weights and data which keep changing.
One of the first steps in starting machine learning projects is to gather data, clean the data, and make it ready for the purpose of experimenting and building models
The initial techniques may start with doing the above in a manual way but without automated pipelines to operationalize these ETL processes, the technical debt increases over time.
In addition, there needs to be a way to store large data either on cloud storage or file storage system. Storage also means proper tooling for gathering, labeling, and making the data access scalable.
Finally, as the data is being transformed, it is key to keep track of versions of data so downstream, when the data is being used for experimentation, training, or testing of algorithms, there is a trackable version of data that run can be associated with.
Once data is gathered and explored, it is time to perform feature engineering and modeling. While some methods require strong domain knowledge to make sensible decisions feature engineering decisions, others can learn significantly from the data. Models such as logistic regression, random forest, or deep learning techniques are then run to train the algorithms.
There are multiple steps involved here and keeping track of experiment versions is essential for governance and reproducibility of previous experiments.