Power Query: Excel's gateway to reproducible analysis
Mar 29, 2018
10 min read Insights
Intro
In this blog post, I’ll try to highlight some of Excel’s functionality which have been around for a while, but remains largely unknown to the broad public.
Now, I’ll be the first one to throw rocks at the Excel camp. I’ve got receipts:
— Taras Kaduk (@taraskaduk) February 10, 2018
My analysis is in Excel. #Loseyourjobin5words
— Taras Kaduk (@taraskaduk) March 5, 2018
Reproducible analysis, case in point. A colleague from another department is out. People come to me w/ requests to recreate some analysis that person did. I have no idea, but OK, let's see.
Colleague's analysis is in Excel. How can I reproduce it?
No analysis for you! pic.twitter.com/SoXZO10ekG
— Taras Kaduk (@taraskaduk) November 27, 2017
However, I’ll also be the first to jump in Excel’s defense whenever an opportunity presents itself:
Generally a very good article on advantages of R, but as always - completely overlooking Excel enhancements that came with Power Query and Power Pivot (e.g. scripting, hundreds of data connectors. I.e. all the Power BI functionality) #rstats #PowerBI https://t.co/k3qj8px81b
— Taras Kaduk (@taraskaduk) December 27, 2017
What gives? I generally agree that Excel is a bad way to conduct an analysis. Having said that, there are many enhancements to the product (Windows version mainly) that came out over the last 10 years which are completely overlooked by both Excel users and Excel bashers. One of such enhancements, Power Query (or Query Editor, or M language), allows for a reproducible data import and transformation, and is quite easy to learn. That will be this post’s topic.
What this post isn’t about
Excel cell formulas
Power Pivot and data modeling
DAX and Excel / Power BI measures
Power Query’s and M’s history
full functionality of Power Query
Excel on a Mac
The audience
This post isn’t a guide to using Excel’s and Power BI’s Power Query. This post is not for people who want to learn new cool tricks of Excel.
This post is for data scientists and analysts who put reproducible analysis (achieved via programming one’s analysis in R or Python) at the center of what they do. This post is for people who like to use Excel as a punching bag (and again, I am one of those people).
Lastly, this post is for those analysts stuggling to switch the workflow at their place of work from Excel to R or Python. I think that Power Query could act as a gateway drug to reproducible analysis. Meet them where they’re at (in Excel), and guide them by hand out into the world of “programming your analysis”.
Power Query
Power Query, a.k.a. Data Explorer, a.k.a. Query Editor is Microsoft’s module within Excel and Power BI which allows users to perform data import and transformations before loading the ready tables into a workbook. It’s been around since 2013. Google for more information.
Components
In the simplest terms possible, I could break the Power Query down into two parts: the language and the GUI.
The M language
Here is the data transformation sample from David Robinson’s DataCamp course on the tidyverse
library(gapminder) library(dplyr) gapminder %>% filter(year == '2007') %>% mutate(lifeExpMonths = 12 * lifeExp) %>% arrange(desc(lifeExpMonths))
I hope I don’t need to read to you what it does (if you can’t read the code, try running it. If you have no idea what’s going on here - I suggest taking the above-mentioned David’s class on the tidyverse)
Here is how I’d solve the same simple task in Power Query. First, Power Query in Excel, unlike Power Query in Power BI, can’t run R scripts, therefore I can’t just load a package. But Power Query can read .RData files. It can also load stuff from the web. We’ll do just that
let Source = Web.Page( Web.Contents("https://github.com/jennybc/gapminder/blob/master/inst/extdata/gapminder.tsv")), Data = Source{0}[Data], col_types = Table.TransformColumnTypes(Data,{ {"", type text}, {"country", type text}, {"continent", type text}, {"year", Int64.Type}, {"lifeExp", type number}, {"pop", Int64.Type}, {"gdpPercap", type number} }), filter = Table.SelectRows(col_types, each ([year] = 2007)), mutate = Table.AddColumn(filter, "lifeExpMonths", each [lifeExp] * 12, type number), arrange = Table.Sort(mutate,{{"lifeExpMonths", Order.Descending}}) in arrange
Now, again, I changed the names of the steps and indented the code for readability purposes. The rest was generated by Power Query and I was just clicking on things. I want to stress it out again: I didn’t have to know any of the functions, any of the syntax. All I did was:
Pass a web link into GUI.
From there, Power Query figured out that it needed a combo of Web.Page(Web.Contents()) to get to the data. It saw a table and guessed column types for me.
From here, I clicked on the “year” column header to filter it, clicked a button to create a new column out of the old one, and the clicked on its header to sort in descending order.
In other words, I came to this with no pre-existing knowledge of coding, and got myself a reproducible piece of code. The data refreshes upon each load: Excel will be checking Jenny’s GitHub page every time we refresh the data, and will be applying the steps as documented.
The GUI
As you may have guessed from my previous paragraph, the Query Editor GUI is bread and butter of this whole scheme: the M language itself is hard to type by hand, the functions are long, it is case sensitive, and there is no good source code editor (Notepad++ and other text editors do a better job than the Power Query itself). But I feel like Power Query wasn’t built to program in: that’s not the main customer base of Excel and Power BI. What Power Query is good at is its GUI that allows users to click around and apply data transformation steps, all the while generating a script behind the scenes.
First, there is a ribbon with several tabs and plenty of buttons to click on. Some represent very simple existing functions, while others are pretty complicated and generate a solid chunk of code on just one click.
Next, you are allowed to interact with your data to some extend. You can’t edit any cells, but you can filter columns, move them around, fill them down, sort, and so on, within the table itself. Power Query will pick up on your actions and will save your transformations in a script.
You can have more than one query, coming from different sources, and you can make them interact with each other: merge (join), append (union), reference, split, nest and so on.
You can also re-arrange the query steps in the GUI via simple drag-n-drop, and your script will be re-written to reflect the new order.
What I like about Power Query
There are quite a few things I like about Power Query:
Learning curve. It’s easy to get started with Power Query and create functional reproducible scripts out of the gate. The powerful GUI allows for that. No setup necessary: no installation, fine tuning, no ODBC drivers and connections. It just works out of the box. Almost like it wasn’t Microsoft creating it.
Accessing data. Getting the data is the most seamless experience I’ve had. It has pre-built functions for many data sources (I’ve got receipts: here is the list of functions ), and can recognize a large amount of data formats. It doesn’t require any ODBC setup. The Power Query flavor that runs on Power BI can also use R scripts as a data source or a data step, yet this functionality isn’t a part of Excel yet.
Tidy-like data storage and display. Surprised? Yeah, with a few exceptions, Power Query treats everything as a data frame. What if it’s not a data frame? Then it tries to fit the data into a rectangular shape. I’d like to show a few examples.
Here is an example of how Power Query treats JSON files. I used the well-known (thanks to Jenny Bryan’s tutorials!) API of Ice and Fire . Here is what I’ve got after a few clicks. Note that I only plugged in the API call as a URL - Power Query did the rest.
It is a data frame, but it has a nested list column for titles. Neat!
Not only JSON records get nested. You could have a nested table (can happen upon a join or after a group_by-like call), or a nested list. Regardless, Power Query will always try to make your data rectangular, which is pleasing to any tidyverse adept. Here is another example. I took the same Gapminder dataset, and nested it, grouping by country. The table above is now how Power Query sees the table. The data frame below is a sneak peek into one of the nested cells for the United Kingdom. The function up on top is the step I applied to nest the data frame:
One more example. Here is how Power Query sees a folder full of files:
Likewise, if you told Power Query to access a database, and didn’t specify SQL statement, it would return a data frame of all tables and views in that database, and you can take it from there.
Robust GUI that keeps the code. Another cool feature is that the GUI can handle 90% of one’s needs, and it scripts all transformations behind the scenes. I think of it as a gateway drug to “programming one’s analysis”
What I don’t like about Power Query
Speed. It is unbelievably slow. I guess the convenience comes at a cost of performance. It does well on small datasets and simpler operations, but fails to do a decent job the moment you scale.
Syntax and flexibility. The language is rather rigid, and the syntax is annoying at times.
Limits. It can only do so much. You can import and transform the data, but you can’t do anything else here.
Conclusion
What we have these days is several generations of knowledge workers trained on doing their analysis in Excel. Just bashing the tool is not productive. We think we provide better alternatives with R or Python, but we frequently forget about the learning curve associated not only with learning a new language, but also with learning a language for the first time. Excel’s Power Query could serve as an important stepping stone in taking the analysis out of the Wild West world of Excel VLOOKUPs into the world of reproducible code, git repos and other warm and fuzzy things.