Logo

The Data Daily

Anaconda | Propelling Python Into the Next Decade: Anaconda’s OSS…

Anaconda | Propelling Python Into the Next Decade: Anaconda’s OSS…

This year marks Anaconda’s 10th anniversary, and we’ve been taking time to reflect on all we’ve accomplished as a company and as part of the Python open-source development community—and to think about where we are going from here. There is still so much to do, and in this blog post I’ll try to provide a snapshot of where we see some new challenges for Python and where Anaconda’s open-source development work is heading.

Before jumping into an extensive vision document like this, it is important to acknowledge its limited perspective. I want to talk about the problems that Anaconda open-source software (OSS) developers are focusing on, but this is not a proclamation that these are the problems everyone in the Python ecosystem should focus on. OSS works best when there is a diversity of perspectives, priorities, and skills operating asynchronously to explore the space of possible software. Only one of us has to uncover a good idea, craft a useful bit of code, or write an excellent document for us all to benefit. The goal should be to find a place where each of our perspectives and skill sets can have a positive impact, and this blog post is a reflection of where we think Anaconda is well situated to help move open-source Python forward.

Before talking about what we want to see in the next 10 years of Python, I first want to stop and discuss why we want Python to continue to be a popular language. Python has come to dominate a wide range of programming use cases for a variety of reasons:

Of course there are ways in which Python falls short of these ideals, but it has been successful enough at meeting them to become one of the top languages in the world. We want to lean into these strengths and keep pushing Python to become a computational tool that anyone can use anywhere—which is a challenge, as there are large segments of “anyone” and “anywhere” that Python will miss out on without exploring in some new directions.

The goal of making Python into something “anyone can use anywhere” is becoming harder to achieve as we are presented with more and more computing options. For so much of Python’s existence, the vast majority of developers only had to worry about a small number of chip vendors and operating systems, new platforms were slow to emerge, and more exotic system architectures were generally only available to a handful of expert developers and users. A number of technology trends have permanently broken this status quo, and while many Python developers have worked hard to adapt Python to the new reality, there is still a lot of work to do.

Through much of the history of the microcomputer, we could rely on hardware performance to increase year over year, without requiring any fundamental changes to the software we were writing. This ceased to be true in the mid-2000s, as multi-core central processing units (CPUs) transitioned from exotic to essentially mandatory to continue to improve the performance of general purpose CPUs. Since then, we have seen the rise of chip specialization to squeeze out more performance from a fixed power budget, including such things as vector instruction sets, graphics processing units (GPUs), AI accelerators, on-chip hardware accelerated video codecs, and an explosion of silicon startups trying to bring other novel ideas to market.

In one sense, Python, as a high-level interpreted language, is amazing for being able to write code that runs on many different chip architectures, as long as someone has done the heavy lifting to port the Python interpreter and recompile all the needed Python extension modules. However, that same Python interpreter has many assumptions baked in, which makes usage of multicore CPUs challenging and doesn’t directly offer any help for taking advantage of the multitude of specialized accelerator technologies (like GPUs) that are available. That work has fallen to a booming ecosystem of Python modules, but each has had to basically build their own compiler infrastructure from scratch to support these new forms of computing. For the majority of smaller Python projects, adding accelerated computing support is still very daunting.

For most of us, the thing we program is a single “computer,” which we imagine as an operating system running on a CPU which has access to data in one of three locations: RAM, “disk” (SSD or otherwise), and the network. RAM is much faster than disk, and disk is similarly much faster than the network. For compute-bound workloads, we often ignore the speed of the disk and the network as one-time startup costs, and imagine RAM as a uniformly fast source of data, regardless of how it is laid out in memory or what CPU core is accessing it. Data is either “in memory” or not, and we don’t need to worry about anything else beyond that.

In reality, computer architectures are much more varied and complex. We have servers with huge numbers of CPU cores with non-uniform access to memory, potentially multiple levels of non-volatile memory with different speed and latency characteristics, accelerators (like GPUs) with their own memory systems and interconnect strategies, and extremely fast and low-latency networking technologies. The inside of a computer is looking more like a cluster, with resources connected by a dizzying array of bus technologies. At the other extreme, high-performance networking and cluster orchestration tools are making the boundaries between servers in a rack or a data center more fuzzy. As a result, it will be critical to have Python tools (like Dask, to name one example) that allow users to easily partition computations across many compute domains, manage data locality, and move compute to data whenever possible.

Back in the early days of Anaconda (when we were still called Continuum Analytics), one of the common points of discussion I had with enterprise customers was if and when they would be migrating some of their workloads to cloud computing services. Those timelines were often in the hazy future, but now that time of transition is mostly past. Cloud computing is now part of every organization’s IT strategy (if only as part of a hybrid approach), and cloud technologies have trickled down to become part of many individual developers’ and research scientists’ toolboxes (see Pangeo, for example).

This trend does not bring many new challenges to Python itself, as most cloud APIs are already paired with excellent Python libraries. Instead, cloud computing has lowered the capital costs of using all the new technologies coming from trends #1 and #2. In minutes, anyone with a credit card can have access to a server or a whole cluster with cutting-edge CPU, memory, and networking technologies. The only thing stopping such a user from benefiting from this amazing hardware will be the flexibility of their software stack and the ease with which they can port their code to actually take advantage of it—or lack thereof.

When the Anaconda Distribution first was created, the hardest problem to solve was getting the most popular Python data science packages built and working on Windows, Mac, and Linux. Although incompatible in various ways, these three desktop operating systems are much more similar than they are different. Python is well-adapted to a UNIX-like world of command lines, files, sockets, processes, and threads, which are present in some form on every popular desktop operating system. With some caveats, a Python user can feel very at home developing on any one of these platforms.

However, the past 10 years have seen a massive shift in the computing landscape. More people than ever before have access to a “personal computer,” but they are not the traditional “PCs” that Python grew up with. The most common personal computing systems are now phones, tablets, and Chromebooks. Even on desktop systems, most users spend a huge amount of time using a web browser. As a result, the most popular software platforms in 2022 are (1) the web and (2) mobile operating systems like iOS and Android. The distinguishing feature of these new platforms is that they are radically different from the desktop operating systems of old. There are significant security restrictions on file and network access, they have no built-in command-line interfaces by default, and software distribution mechanisms are entirely unfamiliar to someone used to writing Python scripts. Nevertheless, web browsers and mobile devices are where the next generation of Python users need to learn how to code and share their work. How do we give them the tools to do so?

We can’t tackle these challenges with one magic solution, but rather we need several overlapping, complementary strategic directions that will support each other. We’ve identified five areas here at Anaconda where we believe we are well positioned to have an impact. Each of these areas is bigger than any single open-source project, but I’ll mention a few example projects along the way. The list will not be comprehensive, nor should it imply that we think other projects don’t also address these issues. We will need a variety of approaches from many different groups to tackle the challenges before us.

The key tool in the mapping of human-readable programs to hardware-executable machine code is the compiler. We might imagine that Python, as an interpreted language in its most common implementation, doesn’t need to worry about compilers, but we’ve seen evidence that compiler techniques are essential for enabling Python to improve performance. But Python doesn’t need a single, universal compiler; it needs a whole range of compiler tools to address different needs.

At the broadest scale, we want to see Python interpreters that can incorporate just-in-time compilation methods to remove some of the overhead from executing a dynamically-typed language. As many have pointed out, Python’s design makes this very challenging, but there is a lot more we can do for the majority of Python users out there. Efforts like the Faster CPython work happening at Microsoft, as well as projects like Pyston and PyPy, are all using various compiler techniques to try to speed up every Python program. More work can be done here, not exclusively limited to single thread performance. We also think the nogil fork of CPython is very promising and hope to see ideas from it incorporated into the Python interpreter.

Complementary to the broad approach of improving the Python interpreter, there is also space for more focused compilers to tackle specific domains or unique hardware targets. As one example, Anaconda has been developing Numba with the PyData community for a decade at this point, and it has been adopted by a wide range of users doing scientific and numerical computing. Numba’s focus on numerical computing, primarily with NumPy arrays, has allowed it to achieve extremely high speedups (anywhere from 2x to 200x) on user code. We’ve seen various ways that Numba’s limited scope and extensibility has allowed it to be ported to non-NumPy data structures as well as to new hardware targets. We think this idea could be pushed further to inspire the creation of a set of modular components for both ahead-of-time as well as just-in-time (and hybrid!) compiler use cases. This would further expand adoption of compiler methods in the ecosystem, and make Numba just one example of a broad class of compilers used in Python projects.

Note that adjacent, and very relevant, to the compiler discussion is that of distributing compiled software. This is where we continue to believe conda will be extremely important as we need to be able to deliver a wide range of compiled libraries and applications (not just Python) to an ever-increasing number of platforms. Some of these platforms will be very unusual, like web browsers and embedded systems, so conda capabilities will need to continue to grow.

One underappreciated aspect of the success of projects like Numba and Dask is that their creation depended on the widespread prior adoption of projects like NumPy and pandas. These projects gave Python users a common vocabulary and mental model for working with data in bulk as multi-dimensional arrays (NumPy) or dataframes (pandas). In the case of Dask, these high-level APIs were generally quite amenable to parallel computation and could be ported directly to Dask’s container APIs. This meant that a NumPy or pandas user could switch to Dask, parallelize their computation over dozens of nodes, and not actually have to learn many new concepts. In the case of Numba, the benefits of NumPy usage are more low level. Because NumPy arrays have a well-understood and simple memory layout, it is possible for tools like Numba to generate efficient machine code on the fly to work with NumPy arrays, in a way that is not possible for data stored as collections of Python objects. Without broad use of NumPy arrays as the standard way to manipulate numerical data, Numba would have had a much harder time gaining adoption in the ecosystem.

Arrays and dataframes have been much of the focus of the last decade, but there are other kinds of data out there. We’ve recently become more involved in projects to promote other data models in the Python ecosystem, such as the Awkward Array project. Awkward Array comes from the high-energy physics community, which needs to work with large semi-structured datasets that have nested lists of variable size and missing data. Data of this variety is very awkward (hence the name) to work with in the form of NumPy arrays or pandas dataframes, but is seen in a variety of applications. Similarly, we have also been working on improving adoption of sparse matrices, especially the expanded definition of them that comes from GraphBLAS. GraphBLAS is an attempt to create the standard building blocks of graph algorithms using the language of linear algebra. We think there is much potential for tools like GraphBLAS in the Python ecosystem, and their usage will make future work on parallelizing and scaling computations on sparse and irregularly-shaped data much easier.

Thanks to projects like Dask (and several others), Python has a robust set of tools for working with large data sets on clusters of computers. The essence of these tools is to partition datasets into independent pieces, minimize data motion, and bring the computation to the data whenever possible. This enables efficient scaling in a world where communication latency and bandwidth are significant constraints. At the same time, these tools provide a very useful abstraction layer over the cluster, allowing a user to program it as if it was a single system, leaving a scheduler to decide how and where to execute code. Making the distributed computing paradigm as easy as possible will continue to be an ongoing goal here at Anaconda.

It is equally valuable to think about programming a single computer as if it were a cluster. At a hardware level, this is already somewhat true as systems can be composed of multiple CPU cores with “preferred” memory regions where access is faster. Additionally, computation could be happening within a server on accelerator devices (like GPUs) with their own distinct memory from the CPU. Somewhat unique to Python, however, is the biggest problem, which is that the Global Interpreter Lock in Python limits the ability of even compiled extensions to fully utilize a high core count system with a single Python process. Long term, we need to tackle issues like the global interpreter lock (GIL) (but not only the GIL!) that hold Python back in multithreaded workloads. In the meantime, tools like Dask provide a good interim solution, which is starting a set of Python processes on a single system and partitioning work between them. Dask’s lightweight architecture is perfect for use on a single computer, but there are additional tricks we can play to make the “single system cluster” use case more efficient. Given the availability of very large servers on the cloud, we think there is a lot of potential upside to improving efficiency of Python on high core count servers.

Much of the previous discussion has focused on computation: how to describe computation, how to speed it up, and how to partition it across many computers. Equally important, but less talked about, is the I/O side of the equation. Where is your data and how easily and efficiently can you load it?

The most obvious answer is to put all your data in one place, and in a small number of file formats. This sort of standardization is an attractive option, and some organizations can spend the needed resources to create and enforce consistent usage of a data lake. However, many organizations (and most individuals) do not have it in their power to fully enforce a data lake, leaving data workers to deal with a proliferation of data sources across many different systems and file formats. In fact, the ability to quickly integrate new data sources into an analysis, without having to wait for a proper data lake integration, could be a competitive advantage for some groups.

In much the same way we are betting on computational heterogeneity, we are also betting on data heterogeneity. Data will live where it lives, and while centralizing and organizing it is a worthy goal for many groups, the task will never be complete and tools need to acknowledge that reality. Toward this end, we work on a number of projects designed to give Python users easy access to the widest possible range of data storage systems and file formats. Fsspec has become a popular filesystem-like abstraction layer for many projects (including Dask) that allows files stored locally, in object storage on the cloud (like S3), or in various server APIs (like Google Drive), to all be treated the same way. On top of these abstractions, we created Intake, which offers a flexible plugin architecture and a lightweight YAML-based specification for catalogs of data sources. Intake is designed to work equally well with tools like NumPy and pandas, as well as Dask for larger distributed workloads.

The most recent of our additions to the data access space is a project called Kerchunk, which allows indexing of archival data formats that are usually not designed for efficient cloud computing. Kerchunk can scan a potentially large set of files in a variety of formats (such as NetCDF, TIFF, etc.) and build a metadata index of the data to enable modern tools, like Xarray, to access this data much more efficiently, and without the need to transcode the data first. Being able to make all data, even older data, easily usable in a cloud environment is a key part of our Python strategy.

10 years ago, there was a tendency among many PyData developers to ignore platforms they personally did not use, usually Windows. Resources were scarce, and many maintainers were volunteers, so why should they pay attention to a platform that they were unfamiliar with and was hard to test? This was not an unreasonable attitude, but it would have excluded a huge number of users from the Python ecosystem if not for the efforts of a handful of enthusiasts (shout out to Christoph Golke, among others) who kept Windows support on PyData alive. Windows users of Anaconda were some of the most grateful users I encountered at conferences when I was working at the Anaconda booth. Eventually, the emergence of free continuous integration services with Windows, macOS, and Linux support made all three major desktop operating systems nearly co-equal citizens in the PyData ecosystem.

I believe we are on the verge of making a Windows-scale mistake again by treating the web browser and mobile operating systems as “too strange” to be first-class Python platforms. For sure, the limitations of these platforms are much more significant than the differences between Windows and Linux. Nevertheless, the sheer number of users for whom these are the primary computing platforms is enormous. We need to jump in and figure out what Python should be on these platforms, or risk having a whole generation of computer users miss out on the great things that the Python ecosystem has to offer.

Now, we don’t have to start from scratch here. Python has been visible in the browser for many years thanks to “notebook” projects like Jupyter (and influential older projects, like Sage). Jupyter provides a browser-based front end to running code in a kernel (usually Python) that executes outside the browser. The Python kernel runs as a normal application somewhere, either on the user’s computer, or a remote server, and communicates with the web-based frontend via a websocket. This architecture cleverly sidesteps all the constraints of code execution inside a web browser, but means that a Jupyter notebook is not truly self-contained. I can view a notebook, but cannot interact with it, unless I create a Python environment somewhere and start up Jupyter to load the notebook.

Our long history of work connecting Python to the browser also includes the stack of data visualization projects we helped to create and popularize, such as Bokeh, Datashader, HoloViz, and Panel. These projects are all designed to empower Python developers to create live, interactive data visualizations in the web browser, but without having to become experts in an entirely new stack of front-end technologies. We’ve seen many users create amazing things when given the ability to target the browser from Python.

But, what if we could go one step further and push the Python runtime itself into the browser? In fact, JupyterLite is already doing this for notebooks! Thinking more broadly, there are a lot of great things we can build if we treat Python as a first-class language inside the browser and leverage the extensive library of Python packages we already have. Along these lines, we are very interested in the potential for WebAssembly to bring many language communities (not just Python) to the browser. Python is already a universal glue between many languages and libraries on the desktop. Putting it into the browser (with projects like Pyodide and PyScript) has the potential to now glue the best packages from the Python world together with Javascript. The work will be challenging, and many things will simply not be possible within the confines of browser security models, but even support in the browser for some or most of what Python can do will enable many great applications to be built.

For mobile operating systems, the situation is somewhat simpler. Both iOS and Android are much closer to macOS/Linux than web browsers, and Python can run with much fewer modifications. Most of the challenges here relate to providing Python interfaces to native interfaces (both GUI and other platform services) and automating the unfamiliar process of packaging applications for various app stores. This is where a project like BeeWare is very exciting, with subprojects tackling each of these issues: Toga (GUI), Rubicon (native API bridges), and Briefcase (app packaging). The more platform knowledge that can be baked into BeeWare, the easier it will be for a Python developer of any skill level to turn their prototype into an app and share it with a much wider audience.

In short, yes! The explosive growth of the PyData ecosystem over the past decade has given today’s Python users capabilities and features that were difficult to imagine in 2012. Interactive data visualizations in the browser, GPU-accelerated computing in Python, and effortless distributed computing in a Jupyter notebook all seemed like distant goals back then, but they are a reality now. Imagine what we can do in another decade with an even bigger community than we had back then.

But, we will have to do it together. The ideas described in this post will be much of the focus of Anaconda’s OSS efforts, but these ideas will not reach their potential if our small team is the only one working on them. We hope others will also be inspired to tackle these areas, either by joining us on key projects, or going off and trying out their own ideas. As long as we stay focused on our individual strengths, continue to expand the community of contributors, and learn from each other, great things will happen!

Note: This post was recently adapted for and published by The New Stack.

Images Powered by Shutterstock