Building up a Data Science Team from Scratch

Read original article here

There are plenty of reasons for companies to incorporate data science and machine learning into their business. It can allow you to better understand and predict customer behavior, automate repetitive manual tasks, detect errors and anomalies faster, evaluate business decisions with data instead of mere intuition, get an edge over competitors, give marketing campaigns more punch, check a box for investors, attract more talent for the organization, or brag about it at conferences. Many companies face the challenge of building up a data science team from scratch and it can be hard to figure out how to start.

In 2016, I was the first hire of a new data science team, with little infrastructure or strategy in place. Over the years, there were many different challenges for us to solve and mistakes to learn from as the team got more and more mature. This post is about what I learned about the process of building up a data science team, from both my own experience in the past years and conversations with other data scientists in a similar situation.

In a nutshell, these are the points that I will argue for:

I am not claiming that these are golden rules to success that work for everyone, since every company has different goals and requirements that need to be considered. Instead, think of it as a checklist of topics for which I think companies should make informed strategic choices on, regardless of whether you reach the same conclusions as me.

Especially in the early stages of a company, you will not be able to predict exactly which use cases you will encounter and which skills you will need. If you have a choice between a candidate with deeper and one with broader knowledge, your organization will probably benefit more from the one with broader knowledge. Someone who knows a little bit about a broad range of topics (e.g. hypothesis testing, data cleaning, natural language processing, image classification, anomaly detection, clustering, time series forecasting, neural networks, object-oriented programming, databases, unit testing, containerization, REST APIs, cloud computing, distributed systems or data visualization) will have an easier time in a new data science team than someone who is an expert at specifically one of those things but naive about the others.

Generalists might need more time to get to a good solution for a project, but will be more flexible, have a better eye to identify new business opportunities, and be able to communicate to stakeholders with more context. Even if the stars align and you happen to be a specialist for exactly what is needed for a project, you usually still cannot jump straight to the implementation. Just like the generalist, you typically still need an extensive research and learning phase to make sure that your knowledge is not outdated, because the state-of-the-art in data science and the frameworks to implement them change very quickly. What requires a specialist today, could be automated away with an elegant new library tomorrow. Besides, data science projects usually have very different requirements that demand unique solutions. So the overall benefit of being a specialist is typically lower, because even specialists will not be able to provide answers on-demand without more research on a particular problem.

There are two pitfalls to avoid when looking to hire generalists. First, nobody will be an expert on everything, so do not try to look for a data science unicorn. Candidates should be able to explain the basic problem and approach behind a broad range of concepts, but you cannot expect them to have detailed knowledge on everything. Be relieved to hear “I have no idea” as an answer at some point in an interview, because it shows that candidates know about their knowledge gaps and are willing to admit them. This separates them from the huge majority that tries to hide their weaknesses by name-dropping a series of arbitrary technical terms when they are lost. Second, even though generalists will not know all the details on every topic, they should have deeper knowledge about something. Ask them to explain the specifics of past projects, implementation details of the technique they are most familiar with, or what the last complicated bug they fixed was. Otherwise you have no way of differentiating candidates who just talk about data science from the ones with hands-on experience.

There are a lot of things to figure out before a data science team can start working: Which projects should they work on first? What data is available and how can it be accessed? Does the team build software that runs in production or does it generate insights that guide business decisions? How should the results of the work be exposed to users and stakeholders? Which languages, frameworks or cloud providers can be used? The list goes on and on, and getting the right answers and setting up the first basic infrastructure requires a lot of communication and support from other teams in the company.

Aligning your plans with the needs and goals of other teams is crucial. But once you know what you need to build, the team should have as much autonomy as possible to figure out how to build it, because hard dependencies on other teams can severely block the progress and slow down the momentum of a new data science team.

The most important factor to make a team autonomous is to build a team structure with a broad and balanced skill set to cover data science projects from end-to-end. If you have a team that consists only of data scientists (with the core expertise of analyzing data to identify the best statistical models), then there will be a lot of friction with other teams, since they have to ask other departments every time they need help with something outside of their domain (getting access to new data, setting up databases, making software architectures more scalable, deploying applications to production, etc.). This will block the progress of a new data science team too much, because they will need help from other teams often, quickly and consistently to make progress. Instead, it is increasingly common for companies to adopt the framework that Spotify is known for and combine all the necessary roles within a single autonomous team. For example, if the main output of the team is software, the team could consist of:

It is not important that the resumes of the team members neatly fit into these job titles, but it is important that these skills are covered within the team to maximize autonomy. And as I mentioned in the previous section, this will work better by hiring generalists with a broader understanding instead of throwing several specialized domain experts with little common ground into one team.

Autonomous and cross-disciplinary teams not only help to make the progress more smooth and avoid external blockers, they are also a great way to enable a productive learning culture in which team members can share knowledge and teach each other new skills. In the long run, this will for example give software engineers a better understanding of what they need to optimize for, and data scientists a better understanding of how to structure their work in a scalable way from the beginning.

Apart from the team structure, another important aspect of autonomy is to give the team as much freedom as possible to choose the tools they are going to work with. Do not make it a company policy to only use Tensorflow and Python on AWS. Standardization can have several advantages, like making deployments and code reviews easier, but the landscape of data science tools is still very scattered and there is no one framework that can do everything. Having more freedom to switch between different tools gives the team more flexibility and avoids situations in which a framework has to be forced on a problem for which it was not designed. Also, especially in the beginning of a new team, it helps a lot if team members can choose tools they are already familiar with to get to first results faster.

Ideally, you already have a clear strategy in place of how exactly you will leverage data science for your business and what concrete projects are the most important ones before the first team member is even hired. More often than not, this is hard to determine a priori and you will have to revise your data science strategy iteratively along the way. But even if your plans are not set in stone yet, you will need to have better plans than “I heard it works” or “everyone else seems to be doing it”. Talk to the data science enthusiasts in the company and discuss what the biggest pain points in the company are and what kind of data you collect. If there is an overlap and data science seems like an obvious approach to improve a process, this is a good foundation for a potential data science project.

Once you have some project ideas, you need to prioritize them and decide what the team should work on first. Say you have a choice between:

What makes for a better first project? For a new data science team, it is very important to go for the simple and boring projects first.

Nobody will be impressed by it, your data scientists will say that this is not what they went to school for, your marketing and sales team will be disappointed and the project will not have a big impact. But the project will allow you to learn as fast as possible what infrastructure and processes the team needs to complete data science projects from end to end. Figuring out how to bring any project over the finish line will already be incredibly complex and full of unexpected stumbling blocks. Do not increase the complexity even more by using the most advanced machine learning methods in the most experimental applications. You do not want to spend three months tuning hyperparameters for a deep reinforcement learning model with a custom Q-function, only to find out afterwards that you cannot actually deploy this model without another six months of data engineering work and that the use case is actually not quite as you thought it was. Instead, look for cases where a vanilla logistic regression, basic summary statistics or a simple visualization will already be useful.

In the best case scenario, you will find a project that is both simple and highly valuable to the business. But if you have to choose, go for simplicity. The main goal of the first project is not to revolutionize the business, but to get a feel for what the company needs, who the stakeholders are, which skills are still lacking in the team and which infrastructure is required. To get to this point faster, make sure that the team is not trying to over-optimize the first solution. Data scientists tend to focus on improving model accuracy and paying attention to statistical details. Try to emphasize that simple implementations and fast iterations are more important than accuracy at this point. The earlier you finish your first project, the sooner you will be able to figure out what the team actually needs to be successful.

Data science, machine learning and AI are still highly controversial topics and people have vastly different opinions on them. Common views are:

As someone who has already read this post up to here, chances are that you are more on the optimistic and realistic side of the discussion. Before you can convince anyone else, you have to be clear about your own expectations. If someone asks me why companies need data science, my pitch usually goes something like this:

No matter what your pitch is like and where you stand in this debate, there will always be people in your company who are more pessimistic than you, more optimistic than you or who simply do not know what data science is about. That is why it is very important to set the right expectations, otherwise it will be hard to work together, lead to a lot of misunderstandings and cause conflicts.

Discuss with as many stakeholders as possible what you want to achieve and what they can expect from data science in your organization. Explain that it can take a very long time to build up a data infrastructure before there will be any results. Make it clear that projects need a lot of experimentation and can always fail, and that you cannot know in advance how accurate a model will be or how long it will take to build it. Debunk myths about the magical powers of AI and stress the importance of data quality. Convince skeptics about the long-term value of becoming a more data-driven organization. Be aware that different groups of people (e.g. engineers, managers, marketers or lawyers) will care about different aspects, and that you need to tailor your communication to them to bring everyone on the same page.

Data science is still an extremely vague term to most people which makes it a breeding ground for misconceptions. Try to prevent this and do not let all the technical issues on your plate distract you from the importance of communication and setting the right expectations. It is far better to over- than under-communicate, especially in the beginning of a new team.

Companies that are just getting started with data science sometimes try to test the waters by hiring one or two data scientists, letting them do their thing and seeing what happens. The motivation behind this is understandable, because there is a lot of uncertainty around the question of how data science initiatives will work out, so companies do not want to risk too much at first. They want to invest a little bit, get a little bit of value out of it and learn from the experiences before investing more.

Unfortunately, this is rarely a good idea. It will more likely give you zero instead of a little business value and you will have a hard time learning anything substantial. Your data scientists will not be able to properly finish projects, get stuck in endless prototyping and it will be impossible to set up an effective feedback cycle to iterate on and learn from. The only thing you can really learn at that point is that data science does not work without a solid foundation.

Stop thinking about data scientists like Swiss Army knives who can do everything that is needed for data projects. Being good at statistics and machine learning is only a small part of the equation and you will need to fill other roles too. Depending on your goals, you might need dedicated software engineers, data engineers, product managers, DevOps engineers or UX designers. Apart from that, do not let the team become an isolated island in the company landscape, where nobody really knows what they are doing, but old men with grey beards sometimes tell stories by the fireplace about the magical things that are happening there. Instead, integrate the team with core processes of the company and let cross-departmental information flow in and out frequently. Get other teams involved as much as possible, facilitate communication and work towards shared goals.

Not every single company needs to invest in data science. But if you do think that it is important for your industry, then go all-in and make sure you do it right. Data science is not a plug-in module that you can just add to a company by having a few seats in the building for data scientists. This will not lead to any value and only delay the learning curve to become a more data-driven organization.

Of course there are many more topics to consider, like how to choose tools and frameworks, how data scientists and software engineers can best work together, or how to establish a continuous learning culture. I will save those for a potential part two.

It is one thing to have a plan for how to build up a team, but putting it into practice effectively is an entirely different beast. I am still learning new things about this process every week and I am still far away from where I would like to be. The data science industry is still in a Wild West state and it will take a while until we have found a more robust recipe for effective data science teams and establish more standardized processes.

Thanks for reading! Looking forward to hear about your experiences.

Images Powered by Shutterstock

The Data Daily

Building up a Data Science Team from Scratch