Distributed Computing for Data Scientists

Read original article here

An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management.

Solving simultaneously for all sides of this problem is a cultural and political challenge as much as a technical one. This is the problem that we’re passionate about solving at Coiled, and that we recently spoke about in our PyCon 2020 talk.

In this post, we’ll discuss the data scientist pain points and all of the tooling you need around distributed data work technology to provide your data scientists with distributed computing. In future posts, we’ll do the same for IT and for management.

The end users of these cloud platforms or on-premise clusters are typically data engineers, data scientists, or analysts. These data professionals are used to working on their laptop, where they have full control and full visibility over their system, can modify it at will, and use familiar tools.

They switch to a distributed system because they have to, often because of scale, not because they want to. To be successful, we need to give them an experience that is as flexible and comfortable as what they had before, otherwise they find ways to stay on their laptop.

We often see this reduce down to three main challenges:

Solving these problems for data scientists is particularly difficult because data science workloads are less predictable than the data engineering workloads for which many of today’s cluster resources are currently optimized. We’ll call out these challenges in each of the sections below.

How can I pip/conda install the latest scikit-learn onto the cluster?

Data scientists change their software environments several times a day. They tend to use a combination of conda environments, Python packages installed with pip, custom source code in a local /src directory, and functions defined in Jupyter notebooks, each of which change rapidly, and many of which depend on bleeding edge software. Data scientists live on the edge on software development.

However, most distributed infrastructure today is designed around more slowly moving data engineering or production workloads, which typically use docker images updated on a weekly or monthly basis. The docker image workflow doesn’t make sense for data scientists. There is too much friction and it’s not a familiar tool for many of them.

Data scientists need a system that lets them modify Python packages, and then automatically builds and deploys docker containers for them.

Is anyone using the cluster? Is it ok if I take over all of the GPU nodes for the evening?

Groups transitioning from ad-hoc in-house clusters may be familiar with emailing colleagues to ask if they could take over the cluster for a while. This works well for smaller groups, but falls over for larger organizations. Fortunately, there are mature and robust resource management systems like Yarn, Kubernetes, HPC Job schedulers like PBS/SGE/SLURM/LSF, or various Cloud APIs that can manage sharing resources between different groups.

However, the policies around these systems are typically not well optimized for data science workloads. This is for a few reasons:

Supporting highly volatile workloads is critical to get right. When it fails we see some of the following negative effects:

Where is my data?

As data scientists working on our laptops, we’re accustomed to managing data files on our hard drives. We have tools to discover datasets, move them around, download them from the web, and so on. Tools like web browsers and file explorers are familiar to us from decades of personal computing use.

However when our data moves out to some remote storage we’re asked to learn a whole new set of tools. Those tools are rarely as ergonomic as the operating system on our laptop, and, even worse, they tend to be very different depending on where we get our data (S3 looks different than HDFS looks different than a database).

As a result we often miss out on critical data that our company has painstakingly collected for our benefit, or we rely on the same data subsets that our peers have previously downloaded. Ease encourages centralization.

Data science workloads are particularly challenging here (relative to data engineering or analysis workloads) because data scientists are very often asked to fuse many disparate data sources into novel analyses. For example we might be asked to join the web logs stored in Parquet on S3 with the customer database stored in Snowflake. Knowing how to find and authenticate against many systems is the first and largest roadblock we face.

These are the types of data science problems we’re building products to solve for here at Coiled. The truth is we’re really excited to be building products for scaling data science in Python to larger datasets and larger models, particularly for data scientists and teams that want a seamless transition from working with small data to big data. If the challenges we’ve outlined resonate with you, we’d love it if you got in touch with us to discuss our product development.

— Hugo, Matt, and the whole Coiled team.

Images Powered by Shutterstock

The Data Daily

Distributed Computing for Data Scientists