5 Steps to Propel Python into the Next Decade

Read original article here

Since Anaconda’s inception a decade ago, we’ve seen Python rise to become the most popular programming language, dominating a wide range of programming use cases. Among many positives, the language is easily learnable for non-traditional programmers, versatile for experts to create some of the most complex software to date, and readily available for many platforms. We want to usher Python forward to become a computational tool that anyone can use anywhere — no matter their expertise.

In looking ahead to the next decade, it’s important to consider the emerging challenges for Python and where Anaconda’s open source development work is headed. For so much of Python’s existence, many developers only worried about a small number of chip vendors and operating systems, where new platforms were slow to emerge and more exotic system architectures were generally only available to a handful of experts. However, many technology trends have permanently broken this status quo. Python developers have worked hard to adapt Python to this new reality, but much work still needs to be done. And that’s where Anaconda hopes to contribute.

How can we help Python meet these challenges? At Anaconda, we’ve identified five areas to help Python remain dominant for the next decade.

A compiler is critical for mapping human-readable programs to hardware-executable machine code. At the broadest scale, we want to see Python interpreters that can incorporate just-in-time compilation methods to remove some of the overhead from executing a dynamically typed language like Python. Python’s design makes this very challenging, but there is a lot more we can do for the majority of Python users out there. Efforts like Faster CPython at Microsoft and projects like Pyston and PyPy are all using various compiler techniques to speed up every Python program — and it’s promising to see.

Complementary to the broad approach of improving the Python interpreter, there is also space for more focused compilers to tackle specific domains or unique hardware targets. For example, Anaconda has been developing Numba with the PyData community for a decade and has been adopted by many users doing scientific and numerical computing. Numba’s focus on numerical computing, primarily with NumPy arrays, has allowed it to achieve extremely high speedups (anywhere from 2x to 200x) on user code. We’ve seen that Numba’s limited scope and extensibility have allowed it to be ported to non-NumPy data structures and new hardware targets, and we think we could push this idea further to inspire the creation of a set of modular compiler components for use in even more projects.

One underappreciated aspect of the success of projects like Numba and Dask is that their creation depended on the widespread prior adoption of projects like NumPy and pandas. These projects gave Python users a common vocabulary and mental model for working with data in bulk as multidimensional arrays.

Consider NumPy and pandas, for example, where the high-level APIs were generally amenable to parallel computation and could be reimplemented directly in Dask. This meant that a NumPy or pandas user could switch to Dask, parallelize their computation over dozens of nodes, and not have to learn many new concepts. Similarly, because NumPy arrays have a well-defined and simple memory layout, tools like Numba can generate efficient machine code in a way that is not possible for data stored as collections of Python objects. Without the broad use of NumPy arrays as the standard way to manipulate numerical data, Numba would have had a more challenging time gaining adoption in the ecosystem.

But there is more to data than arrays and data frames. We’ve recently become more involved in projects to promote other data models in the Python ecosystem, such as the Awkward Array project. We’ve also been improving the adoption of sparse matrices, especially their expanded definition from GraphBLAS. These projects hold so much potential in the Python ecosystem that their usage could support a broader range of data analysis.

Thanks to projects like Dask, Python has a robust set of tools for working with large data sets on clusters of computers. The essence of these tools is to partition datasets into independent pieces, minimize data motion, and bring the computation to the data whenever possible. This enables efficient scaling in a world where communication latency and bandwidth are significant constraints. At the same time, these tools provide an abstraction layer over the cluster, allowing users to program it like a single system, leaving a scheduler to decide how and where to execute code. Making the distributed computing paradigm as easy as possible will continue to be an ongoing goal here at Anaconda.

It is equally valuable to consider programming single computers like a cluster. The biggest problem is somewhat unique to Python, however, because the Global Interpreter Lock in Python limits the ability of applications to fully utilize a high core count system with a single Python process. Given the availability of massive servers on the cloud, we think there is a lot of potential upside to improving the efficiency of Python on high-core count servers. Long term, we need to tackle issues like the global interpreter lock (GIL) that hold Python back in multithreaded workloads. For now, tools like Dask provide an excellent interim solution: starting a set of Python processes on a single system and partitioning work between them.

Where is your data, and how easily and efficiently can you load it?

Many organizations want to put all their data in one place in a common format, but it is difficult to create a data lake and enforce consistent usage. The reality is that most organizations (and small teams) don’t have the power to fully implement a data lake, leaving analysts dealing with multiple data sources across different systems and formats.

Similar to how we are betting on computational heterogeneity, we are also betting on data heterogeneity. Data will live where it lives, and while centralizing and organizing it is a worthy goal for many groups, the task will never be complete and tools need to acknowledge that reality. Toward this end, we are continuing our work on a portfolio of projects, such as Intake, fsspec, and Kerchunk, all designed to give Python users easy access to the broadest possible range of data storage systems and file formats.

When the Anaconda Distribution was created, the hardest problem to solve was getting the most popular Python data science packages built and working on Windows, Mac and Linux. However, the past 10 years have seen a massive shift in the computing landscape. More people than ever before have access to a “personal computer,” but they are not the traditional “PCs” that Python grew up with. The most popular software platforms in 2022 are (1) the web and (2) mobile operating systems like iOS and Android. The limitations of these platforms are much more significant than the differences between Windows, Mac and Linux. Nevertheless, the sheer number of users for these primary computing platforms is enormous. We need to figure out what Python should become on these platforms, to avoid having a whole generation of computer users miss out on the great things the Python ecosystem offers.

Python has been visible in the browser for years thanks to “notebook” projects like Jupyter. We’ve seen many users create amazing things when given the ability to compute in the browser, with help from an external Python process. But what if we could go one step further and push the Python runtime itself into the browser? We can build many great things if we treat Python as a first-class language inside the browser and leverage the extensive library of Python packages we already have. Along these lines, we are very interested in the potential for WebAssembly to bring many language communities (not just Python) to the browser. Python is already a universal glue between many languages and libraries on the desktop. Putting Python into the browser with projects like Pyodide and PyScript has the potential to glue the best packages with JavaScript. The work will be challenging, and some things will not be possible within the confines of browser security models, but great promise and potential are on the horizon.

Over the past decade, the explosive growth of the PyData ecosystem has given today’s Python users capabilities and features that were difficult to imagine in 2012. Interactive data visualizations in the browser, GPU-accelerated computing in Python, and effortless distributed computing in a Jupyter notebook all seemed like distant goals — but they are a reality now. Imagine what we can do in another decade with an even bigger community than we had back then.

But we will have to do it together. The ideas described in this post will be the focus of Anaconda’s OSS efforts, but these ideas will not reach their potential without help from the community. We hope we can inspire others to tackle these areas by joining us on key projects or going off and trying out new ideas. As long as we stay focused on Python’s strengths, grow the community of contributors, and learn from each other, great things will happen.

Images Powered by Shutterstock

The Data Daily

5 Steps to Propel Python into the Next Decade