In modern computing solutions, the concept of a DAG or Directed Acyclic Graph is central. While the term DAG has become quite the buzz word: understanding what they are, how they are used in computing, and how/where they show up in data science and machine learning is not just buzz. In short, a DAG describes a sequence of execution steps in the complex non-recurring computation.
How often do you come across the need to create a DAG in machine learning?
Machine learning can be constructively defined to be “the art of building DAGs to treat and transform data using a sequence of advanced mathematical transforms to generate a repeatable formulation that solves a problem for new data.” In Data Science and Machine Learning, a pipeline or workflow is nothing but a DAG. Note that this is not the only place where DAGs are found in Data Science/Machine Learning.
The point is, as you build your ML code, you need to orchestrate your workflow. There really is little reason to do this manually. Unfortunately, not everyone is aware of the tools out there. A first step is to understand what you are building — a DAG — and help you down that path. So let’s get started!
So, why do we call them DAGs? A DAG is a Directed Acyclic Graph — a mathematical abstraction of a pipeline. Let’s break this down a bit, though.
A graph is a collection of vertices (or point) and edges (or lines) that indicate connections between the vertices.
A directed graph is a graph in which the edges point in a direction from one vertex to another. In this case, two edges between the same pair of vertices may exist, but in this case, they will point in opposite directions. A directed graph is often referred to as a digraph.
A graph is cyclic if it contains one or more cycles, where a cycle is defined as a path between vertices along edges that allows you to return to a vertex along with a unique set of edges. A graph is acyclic when it contains no cycles.
Therefore, a directed acyclic graph or DAG is a directed graph with no cycles. A rather simple concept once you can put some definition to it. And look at that — you came here to read about DAGs in ML Pipelines, and you got a mini-lesson in Graph Theory!
So, in the end, the reason that the DAG terminology is used is not just because it is cool to say DAG, but because it describes the nature of a workflow in cases where you NEVER look back to previous steps.
Now, you may be saying… “We loop back all the time in machine learning; the model training step is fraught with it when you are optimizing, recurrent neural networks loop back on themselves, and so on!” — You are correct. A DAG is not a universal descriptor of a pipeline when you zoom in, but it is a descriptor at the high level of execution; it describes the steps being taken — aka the pipeline.
Now, while some folks still will build their pipelines manually, and by some, I mean many or most, this is very labor-intensive — time-intensive — hard to repeat!
Building your pipeline can be done with many tools. A few of which are:
These are tools that help you build compute pipelines as DAGs. They each have their own learning curves and cost/benefit tradeoffs. Choosing what is right for you truly depends on your operating environment, enterprise strategy, level of user expertise required, and whether you lean towards Open Source Solutions or commercial solutions.
Now, beyond building pipelines, and just as a matter of curiosity, DAGs are also present in many of the computing solutions:
And so many more places. It ends up that knowing this concept will help you understand the execution of many processes (not just those related to machine learning pipelines).
Now, go out and read more, learn more, and explore the options out there. There are many tools out there. Look through your code; are you creating DAGs and not aware that you are? If you don’t want to use one of the solutions mentioned, you could even build your own home-grown solution; e.g., by utilizing the networkx library in Python. But guess what? If you are doing that, then consider using Prefect to do this — it is low-level and meant to be used at that programmatic level.
Are you overwhelmed with the concept of a DAG? Which DAG focused orchestration tool should you adopt — or are you using the right tool for training and deploying your ML pipelines? Hashmap can help you here. Our machine learning and MLOps experts are here to help you on your journey — to bring you and your organization to the next level. Let us help you get ahead of your competition and become truly efficient in your data analytics.
If you’d like assistance along the way, then please contact us.
Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, data science, MLOps, and various other technology consulting services.