There’s a lot of blogs, articles, and literature on data science techniques, but much less on how one actually goes about setting up a data science team. What kind of skills are needed? How should they fit together? Who should do what? In this blog I’ll sketch out the crucial elements of a data science setup, introduce a cast of characters, and give an overview of how they work together.
In this context, I’ll be talking about data science teams that build algorithms into software, and so have to worry about code performance, stability, operations etc. This doesn’t apply to data scientists who do analytics or statistics, where the algorithms may be just as complex but the outcome is more a decision, report, or dashboard.
Terms such as “product manager”, “engineer”, and “data scientist” don’t have consistently defined meanings, so if you’re reading this and find yourself going “wait, that’s not a data scientist’s job!” then please, bear with me. This is less about the job title of the person who needs to, say, write unit tests to ensure the code is stable, and more about pointing out that someoneneeds to do this, and whilst that someone is doing that, they aren’t able to do other things like model development or feature exploration.
Also, I’m not implying any kind of organisational design. There’s a number of ways to organize teams on data science initiatives, from the squad model championed by Spotify to the more traditional vertical where all of these roles would report up into a single overall lead. The “right” setup will be a function of the organizational culture, resources, maturity and local constraints, and whilst I do have some opinions here I’ll save those for another time.
Here I’ll refer to the piece of software we’re building algorithms for as the “product”. Products are rarely driven by a single algorithm, rather being built out of a number of algorithms, business rules, user interfaces, APIs etc. For example, let’s say we’re building a tool to optimize routing taxis around a city. This will likely be made of a number of elements:
Now, not all of these have to be in the same product. My point is rather that most data science driven products involve multiple algorithms and components working together, so one has to think more broadly than optimizing a single algorithm or function.
Before we get to the team proper, there’s one vital group of people that we need to talk about: stakeholders. They come in a range of guises: customers, executive leadership, other internal teams, even other data scientists. In essence, a stakeholder is any group that either you want to influence, or wants to influence you.
Now you may be thinking: that’s simple! It’s whoever we’re building the algorithm for! But that would be a naive mistake. Any large scale data science project has a range of stakeholders, and properly understanding who they are, what they want, and how much influence they have (or should have) is crucial to the success of any data science, or other, product.
For example, back to our taxi routing product, who are your stakeholders? Well the obvious two are:
But is that it? Well, likely not. Let’s say you check your email on Monday after releasing the first version of your routing product and you see a few more stakeholders coming into view:
Welcome to your network of stakeholders! Not one of these people/groups can be entirely ignored, but likely they can’t all get everything they want either. So how to trade off all these stakeholders against each other? Well that leads us nicely to our first team member….
They are the team quarterback, or captain if you prefer English football to the American kind. They’re in charge of making sure the team is working on the right things at the right time, and managing the complex web of stakeholder needs.
Firstly they act as a triage point. Businesses are often fast moving environments with large numbers of vocal stakeholders: exposing the team directly to that can be hugely distracting. Product management does the vital job of working out which stakeholder asks need to be worked on immediately, which can wait, and who should do the work. This latter part is especially important with Machine Learning/AI as they’re much hyped technologies that everyone wants to be using. In my experience at least half the asks for an AI solution will be better solved by more prosaic approaches such as improving user interfaces, analytics, or data engineering.
Next they take the broad, unstructured requests from stakeholders and scope them into discrete tasks for the team to digest and work on. Stakeholders don’t know the inner workings of the algorithms & code, so can only point to aspects of the product that they don’t like or feel should be added. Translating these broad asks into a specific alteration that make sense, works in tandem with all the other moving parts, and will fully answer the need articulated is the nuanced art of scoping. This is a difficult but vital role and doing it well is crucial to the health of the product. In my experience, most project failures are scoping failures.
Finally, great product managers aren’t just sausage factories for taking in a blend of stakeholder requests and churning out prioritized, scoped work, but have their own vision of how the product should evolve. You may roll your eyes at the hackneyed term “vision”, but the ability to think ahead and say “this is what this product should become” in a way that actually makes people want to get out of bed and make it happen is a crucial element to success. Without vision, products tend to turn into a Frankenstein’s monster of what different stakeholder groups want the product to be, doing a bunch of them acceptably but none of them excellently. A good test is that if you cleared all the requests from stakeholders away, would you still have a clear view of what you want to do?
You’re probably surprised that I’ve taken nearly 1,000 words to get to the data scientists. Whilst I should probably apologize for that I won’t, as it illustrates how much careful thinking needs to be done before one starts writing code.
Their first responsibility is selecting the right algorithm objective. For example, should our taxi routing aim to maximize the number of trips, the number of customers, the total revenue of a given taxi, or something else? Are some customers or journeys more important than others (e.g. female customers at night)? Though the objective can often seem simple at first, for any problem there’s a large number of possible objectives and careful selection of the right one is a key part of the data scientist role.
Once an objective is defined, that naturally leads to how the data scientist will assess accuracy against that objective. Again, this can take a number of forms, our taxi data scientist can choose the average error in trip length prediction, either in absolute terms (which will focus the algorithm on longer journeys) or as a percentage of actual trip time (which will focus the algorithm on shorter journeys). Also, it’s important to pre-declare what the right accuracy criteria are in advance of starting, to avoid the less-than-scrupulous approach of just generating a large number of statistics and picking the ones that look best.
For any algorithm to give good results the training data must be carefully curated. Algorithms are trained to reflect the patterns and relationships present in the data we feed them, so poorly thought through training data can cause real problems for a data science product. For example, if our routing training data-set is only generated from private vehicles, and we’re designing an algorithm for taxis which can use bus lanes, we may set the right objective and get great accuracy on the hold out sample, but poor results in the field.
Lastly, the algorithm approach, which includes model selection, design and feature engineering. These are the mainstay of data science and where often most of the time is spent. There’s already a lot of literature on this topic, so I won’t add to that here apart from calling out a point that often gets missed: it’s important to think through how stableyou need your model outputs to be. For example, if our routing model gives a slightly different route each time you ask it to get you from A to B that probably isn’t an issue, but if our pricing algorithm gives a different price every time you ask the same question that’s going to cause you a lot of problems!
Once we’ve an algorithm ready it’s time to get it in front of some users, and for that we’ll need…
A truly great data science product can only be built on the foundation of great engineering. Your math can be perfect, but if the model is down half the time or full of bugs then it won’t be much use. Engineering are therefore crucial players, and have a number of roles.
In the ideal world, the data needed to train and build models would already exist, in a nice clean format with all the relevant attributes stored in a single, easily accessible data-set. Sadly, data nirvana is always out of reach as even in the best data environments the needs from data are constantly evolving as the business and contexts evolve. Therefore, you’re going to need engineering help in data creation. This can involve a fair mix of detective work, ETL creation and especially for large datasets, creativity around cost effective storage and retrieval solutions.
Once you have your data and model you’ll need to string it together into a piece of software that can execute reliably when needed, or a “data pipeline” to use the jargon. This can be as simple a weekly “batch” run to drop a small text file in a shared folder, to massive real-time “streaming” pipelines involving huge amounts of complex data. Also, if the algorithmic output is being consumed by another service (e.g. an app) then further engineering may be needed to ensure the data pipeline meets the standards of the service consuming it.
Finally, once this is all built there’s operations. Is the system up and running when it needs to be? Are all the downstream components and datasets correct and up to date? Do the outputs make sense? Keeping the ship running in a way people can rely on it may not be the sexiest part of data science, but you’re nowhere without it. Now, any of my engineering colleagues reading this will be raising their eyebrows at the notion that good operations are just their job, and they’re right: it’s everyone’s job. But engineering is often the first line of defense in analyzing and triaging bugs, and building resilience into the system.
Finally, any large scale data science powered product generates questions: is the product meeting its objectives? How do we know the new version of the algorithm is better than the old one? How well are the algorithms responding in specific scenarios? Answering these questions is the job of analytics.
Firstly, any data science driven products are built around specific targets, these are usually linked to the algorithm objectives, but can often be broader. For example our taxi routing product might have targets such as a certain number of customers or revenue. Tracking how the product is doing against that target, and explaining over or under performance, is a key analytical role.
Next up, any product can generate an almost limitless number of questions from stakeholder groups. How fuel efficient is our taxi routing? How do waiting times vary by city? Are we over or under-serving different customer profiles? Whilst some of these will require the data scientist to dig into the guts of the algorithm, in most cases intimate algorithm knowledge isn’t necessary and you’re better off utilizing the rapid querying & data visualization skills of an analyst.
No product starts out great, to really beat and stay ahead of the competition testing is crucial. Websites and apps often have the luxury of randomized AB testing, but even without that creative analysts with good statistical knowledge can find ways to design tests to confirm that an amazing idea or new algorithm feature really is everything it’s hoped to be.
I’ve presented the team in a sequential order from stakeholder > product manager > data scientist > engineer > analyst, and whilst there’s some truth to that in reality all team members need to be involved in all parts of the project life-cycle to some degree. Product managers can’t craft the right solution without understanding the data science, data scientists need to build algorithms that will actually run given the engineering realities, and engineering needs to keep a close eye on the product objectives when building out infrastructure.
Most importantly, a culture of shared success and mutual cooperation is vital to making the team effective. Victories need to be shared by all team members, and problems something that the entire team needs to think through to overcome. If bumps in the road are met with finger pointing then you’ll quickly find that problems stay hidden and decisions are taken to protect egos, not drive results. Ultimately, customers and stakeholders aren’t going to care who exactly screwed up or deployed this or that feature: they just see a product that either does or doesn’t do what they want.
Building a great algorithm is hard, but building a great data science product can be downright fiendish. The outline I’ve provided should give you a view on the key pieces that need to be in place for you to have a fighting chance. A good team setup won’t guarantee your success, but a bad one will guarantee your failure.