R programming language can read all sorts of data, and XML is no exception. There are many ways to read, parse, and manipulate these markup language files in R, and today we’ll explore two. By the end of the article, you’ll know how to use two R packages to work with XML.
We’ll kick things off with an R XML introduction – you’ll get a sense of what XML is, and we’ll also write an XML dataset from scratch. Then, you’ll learn how to access individual elements, convert XML files to an R and a , and much more.
First, let’s answer one important question: What is XML? The acronym stands for Extensible Markup Language. It’s similar to HTML since they’re both markup languages, but XML is used for storing and transmitting data over the internet. As you would assume, all XML files have an file extension.
When you first start working with XML files you’ll immediately appreciate the structure. It’s human-readable, and there aren’t a gazillion of brackets as with JSON. There are no predefined tags, as in HTML. You can name your tags however you want, but it’s best to name them around the business logic.
All XML documents start with the following – the XML prolog:
Each XML file also must have a root element that can have one or many child notes. All child nodes may have sub-childs.
Let’s see this in action! The following code snippet declares an XML dataset containing employees. There’s one root element – , and each child has sub-childs, such as :
Copy this file and save it locally – we’ve named it . You’ll need it in the following section when we’ll work with XML in R.
But before we can do that, you’ll have to install two R packages:
Both are used to work with XML, and you can pretty much get around by using only the first. The second one has a couple of convenient functions for converting XML files, which we’ll cover later.
First things first, let’s see how you can read and parse XML files in R.
By now you should have the dataset downloaded and R packages installed. Create a new R script and use the following code to load in the packages and read the XML file:
Here’s what it looks like:
The data is all there, but it’s unusable. You can make it usable by parsing the entire document or reading individual elements.
Let’s explore the parsing option first. Call the function and pass in :
The contents now look like our source file:
Pro tip: if you don’t care about the data, you can print the structure only. That’s done with the function:
If you want to access all elements with the same tag, you can use the function. It returns both the opening and closing tags and any content that’s between them:
In the case you only want the content, use either , , or function – depending on the underlying data type. The first one makes the most sense here:
You now know how to do some basic R XML operations, but most of the time you want to convert these files to either a tibble or a data frame for easier access and manipulation. Let’s see how to do that next.
Most of the time with R and XML you’ll want to extract either all or a couple of features and turn them into a more readable format. We’ve already shown you how to use to extract text from a specific element, and now we’ll do a similar thing with integers. Then, we’ll format these two attributes as a tibble.
Now we have the department names and salaries for all employees. From here, it’s easy to calculate the average salary per department (note that only the IT department occurs twice):
In case you want to convert the entire XML document to an R data.frame, look no further than the package. It has a convenient method that does the job perfectly:
That’s all the loading and preprocessing needed before you can start analyzing and visualizing datasets. It’s also the most common pipeline you’ll have for loading XML files, so we’ll end today’s article here.
XML files are common in 2022 and you as a data professional must know how to work with them. Almost all R XML-related work you’ll do boils down to loading and parsing XML documents and converting them to an analysis-friendly format. Today you’ve learned how to do that with two excellent R packages.
For a homework assignment, try to read only the attribute, and make sure to parse it as a date. Is there a built-in function, or do you need to take an extra step? Make sure to let us know in the comment section below.
The post R XML: How to Work With XML Files in R appeared first on Appsilon | Enterprise R Shiny Dashboards.