The Data Daily

3 Great Design Patterns for Data Scientists

3 Great Design Patterns for Data Scientists

When writing code as a data scientist, your goal is often to write things quickly so that you can vet whether or not something is a good idea before you get too far down the road. Nobody likes to spend months working on a project only to find out that it’s garbage.

So you write your code as quickly as possible when prototyping. But what happens when your just-get-it-working-for-now code isn’t cutting it anymore, and your code needs to be more robust and maintainable? This is where design patterns come in handy.

To put it simply, design patterns are common solutions to common problems when writing software. What makes them so great is that they’re so universally applicable, but you have to know how to apply them. You can learn more in-depth about some common design patterns here.

I can think of a couple of reasons that I love using them.

So, without further ado, let’s get into 3 great design patterns for data science workflows.

The builder pattern is a flexible way of creating complex objects, especially when these objects share a lot of similarities but have a lot of optional parameters. The builder pattern takes the object construction logic out of the object itself, and instead creates relevant properties for the object on the fly — often by using the method chaining technique. The key to enabling method chaining is to return the object itself from methods used to build the object you want, so that chained methods can modify the same object.

I write a ton of SQL queries day to day, and found that there’s a lot of similarity in structure to most of my queries. However, writing them by hand is a fairly error-prone process and creates a lot of duplicated code. So rather than writing dozens of individual queries, I use the builder pattern to generate queries for me. This also comes in handy a lot when I write big, nasty queries with nested select statements and multiple joins, where it’s easy to get lost in the weeds and make mistakes when writing queries by hand. Not to mention this method is easily testable, whereas writing SQL queries by hand is harder to test!

Let’s write a simple query builder to illustrate how this pattern can be useful.

I first initialize the builder with the base table from which I’ll be selecting tuples. Then I can add columns to select, ‘group by’ clauses, joins, and ‘where’ clauses as I need them. This is overkill for a simple “SELECT * FROM foo” type of query, but these building blocks make it easier to build more and more complex queries.

Here’s an example of using the builder pattern to make a simple SQL query generator:

In its simplest form, dependency injection is when you insert the thing you’re depending on as an argument. Don’t know which database class to use? Your function doesn’t need to know how the database class works, just that it does. Passing in the database class instance as an argument makes it easier to maintain — you can use any kind of database class that follows the same interface.

Without using dependency injection, you’ll have a much harder time maintaining critical infrastructure like database classes.

One other great benefit of using dependency injection is that your code is much easier to write tests for. Just write a mock class (i.e. a mock database class) and use that in your tests, rather than having to use code that runs HTTP requests and slows down tests, for example.

My team uses both SQL Server and Cosmos DB, as well as other data sources. Passing in the database class as an argument makes it easy to swap out different databases for different ideas, and makes writing testable code a lot easier, since database classes are easy to mock.

Here’s a simple example of using dependency injection:

The decorator pattern is useful when you want to do something before and/or after a function, but don’t want to modify the function itself. Essentially, what you’re doing is capturing some state before your function runs, then capturing some state after it’s done. This becomes very apparent when you have dozens of functions to modify in the same way, but can’t afford to change them individually.

Things that I’ve found useful are how long the function runs, the function’s name, and sometimes different features about the output. Thankfully, Python functions are objects, so you can use the ‘@’ decorator syntax for this pattern. All you need to do is create a function that wraps an inner function, then place the @my_decorator_name decorator before the function you want to decorate.

It’s easier to see an example than to explain it with plain English :)

I won’t get too deep into how decorators work in Python, but RealPython has a great article I highly recommend as a primer.

Reusing some of the code from the dependency injection example, we can time how long our database transaction would take:

Design patterns make for very reusable code, and you can put pieces together like building blocks to make your work a lot easier as a data scientist. For example, I’ll often combine all three of these patterns to write queries to a database and see how long the query took in order to know if I need to optimize.

In this article, I’ve shown three ways to use design patterns as a data scientist for more robust, maintainable code. When you use design patterns in data science, your code quality goes up, your maintenance is easier, and your results are easier to reproduce and share.