Today, we’ll share the discussion centered around pandas, another bridge between users and their data, and how Ibis and pandas are related.
First, why was there much discussion about pandas? The pandas Python project is a sprawling behemoth of a DataFrame and analytics package. It has close to three million daily downloads and is one of the most popular DataFrame APIs for Python analytics workflows (if not the most popular).
So, what does Ibis have to do with pandas? For starters, they have one thing in common: both were created by Wes McKinney, Co-Founder and CTO of Voltron Data, to facilitate and streamline data analytics in Python.
The difference between the two is how they go about accomplishing this.
Modern datasets can contain millions – sometimes billions – of rows of data and local execution requires much, much more memory than it did 10 years ago.We recommend reading McKinney’s article,10 things I hate about pandas, where he noted, “pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset.” Managing large datasets locally can get prohibitively expensive for growing teams of data scientists, data engineers, and students.
of varying SQL dialects, reference python objects directly in their pre-processing, and push all of the hard work on the backend.
Once all of the memory-intensive transforms are complete,
and their corresponding functionality).
Conceptually, Ibis and pandas are very similar: they both manipulate tabular data. We can take a DataFrame, and transform it by operating on columns or by creating new ones.
For users that are interested to see how Ibis compares to pandas for their particular workflow, we can check out similarities between the two APIs to see how easy it is to learn Ibis coming from pandas.
First up is the return type. The default return type from Ibis’s execute is a pandas DataFrame. With this, all pandas DataFrame methods and operations are available.
Next is column referencing. Referencing a single column in an Ibis expression is the exact same way you reference a column as a series in pandas–simply use single brackets and a string column name or ColumnExpressions.
There is a slight difference between Ibis and pandas, though. In pandas, if you want to select multiple columns (to return a DataFrame containing a subset of columns), you would use single brackets enclosing a list of string column names. In Ibis, you can use single- or double- brackets and comma-delimited string column names or ColumnExpressions:
The more Ibis-y way of selecting columns is to use the method on a TableExpression, though, so our opinion is to just use that instead (the select method also accepts string names or ColumnExpressions):
The last similarity that we’ll discuss is groupby. Ibis does support groupby and aggregations, just like pandas. You can group a TableExpression just as you would a pandas DataFrame and then aggregate:
Next in this series, we will discuss what we find interesting and exciting about switching certain workflows from pandas to Ibis, particularly code portability and performance gains.
In the meantime, download and try Ibis today. You might find performance boosts by switching some pandas loads and transforms for Ibis selects, filters, and mutates.
Earlier this year, Voltron Data added Ibis support to our Enterprise Subscription services. If your company is interested in developing tools and workflows built on top of Ibis, please take a look at our subscription tiers and get in touch.
If you want to learn more or stay up to date with the Ibis project, tune into these channels: