The 4 Steps to Branching in Git that Data Scientists Should Know

Read original article here

I believe in using version control, no matter how small the project is. It’s especially important when you are doing any kind of rapid, agile, or iterative development. Training a model and hyperparameter tuning are prime examples of situations where you’ll want snapshots of multiple experiments and milestones.

If you are new to Git, you may be wondering what a branch is? The easiest way to visualize how branches work would be the plot of a time travel movie. There is the main timeline, which in Git is called master. When you want to do something new, you can branch off of the main timeline into an alternate timeline. The primary and alternate timelines are still going forward in parallel, but at any point, you can jump back into the main timeline. When you do this in Git, you can choose whether to merge your changes.

It looks something like this:

All jokes aside, understanding how and when to branch is critical to working with Git. I’ll walk you through the four-step process I use on my projects.

As I mentioned earlier, when you create a new repo, you are automatically working in the master branch. You can see this by typing the following in the command line:

This should return all of the branches in your project. The one you are currently working in is green.

At this point, we should take a moment to discuss when you should create a new branch. Let’s say you are getting ready to work on a new feature or refactor something. You don’t want to worry about checking in broken code while you figure things out, so it’s a good idea to make a sandbox to experiment in. Think of a branch as a copy of your main project where it’s safe to make changes without affecting the main codebase.

I use a specific naming convention for my branches. To create a new branch, you’ll use the previous command but add the name of your branch to it like so:

As you can see, I use a software versioning convention to name my branches. I start at and work my way up. So the next branch would be until I reach a new full version like . Sometimes, you need to make an incremental change to an existing version. In this case, I would create a branch called .

If you work in a group, you may want to designate one person as the repo owner and designated them as the one person that can create a new feature branch. Everyone on the team could work directly in that branch or create a sub-branch of their own. In this scenario, I would create a personal branch called or add a specific feature name to it like .

Over time you will have a complete history of all the major feature releases of your project. It could look something like this:

Creating a branch is just part of the equation. If you want to work in the new branch, you need to do a checkout first. Checking out a branch is done with the following command:

When the checkout is complete, you will see a confirmation in the console:

Now, if you type in the console, it will highlight the new branch you are working in:

There are a few things you should keep in mind when checking out branches.

The best thing to do before switching branches is to use and make sure everything is good to go before using .

Finally, if you are working on a project you downloaded from an online repo, you may not have local access to all of the branches. You can use to pull down references to the remote origin’s branches, then check out the one you want.

So after you have been working in a new branch long enough, it’s time to save everything you have done. And while committing changes has incrementally done this for you, at some point, you’ll want to merge the changes back into the master branch. When you are done making changes in the working branch, and the code is stable, it’s time to check out the master branch with the following command:

Once you are in the master branch, you can merge the feature branch changes with the following command:

You always want to be in the branch that will receive the merge before you use the command. Assuming everything merged correctly, you should now have all of the code from your feature branch inside of the master branch.

If you run into a conflict, the console will tell you where the issues are. You can work through these issues using a variety of strategies that are a bit out of scope right now.

The final part of this process is to tag the changes you made. Tagging is similar to branching. The tag represents a snapshot of the project at a given time. Now that you have merged a feature branch into the master branch, it would be a good time to create a tag of that version. So while you are still in the master branch, type the following into the command line:

Notice I don’t add to the tag’s name. This is to help denote a finished and stable point in the project. So when someone checks out the project, they can immediately build the project in the latest stable state from the master branch. Likewise, if you want them to only use a specific state of the codebase, like what was shipped on a specific date, they can switch to the corresponding tag.

You can see all of the tags in a project at any time by simply typing into the console.

Just like you will see a list of all of the tags you have locally in the current repo.

To sum everything up, these are the four essential commands when it comes to branching in Git:

Now you can do steps 1–4 to keep creating new branches to work in. Remember, while working in a branch, you can commit just like you usually would. Use those commits and their comments to track the incremental progress you make towards a stable release.

The last thing to point out is that you should never commit code to a tag. The proper way to patch a tag would be to check out the branch it was derived from and create a new branch from there. For example, tag corresponds to branch . This means you would add the patch code to a new branch and then merge those changes back into and tag it as .

If the branch is ahead of the feature branch you are patching; you need to decide if you want to tag from the current branch or merge to master and include those changes in the next release’s tag.

While you may want to the changes to the branch, it may need to be merged into additional feature branches as well. So to keep things simple, I would only patch a tag if there is a critical need to update production that can’t wait until the next release.

The only hard rule I have is that no code or output from the project goes to production without a tag. It’s critical to know what version is live and have a snapshot of it in case something goes wrong. This starts getting into the DevOps side of things. If you’d like a primer on how DevOps evolved over the past few decades, be sure to check out my article on “The Evolutions of DevOps Culture.”

I hope this is helpful if you are new to Git or looking for some help when it comes to working with branches? I am happy to answer any questions you made have if you leave a comment below.

Images Powered by Shutterstock

The Data Daily

The 4 Steps to Branching in Git that Data Scientists Should Know