Efficiency comparison of dplyr and tidyr functions vs base R
Posted on October 19, 2022 by R with White Dwarf in R bloggers | 0 Comments
[This article was first published on R with White Dwarf , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here )
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Share Tweet
A couple of years ago I was interested in the efficiency of R when it comes to time processing and management of memory and I read a few blog posts about this topic, particularly pointing at the fact that R hasn’t been designed to be a very efficient language, especially when it comes to big data processing, and this could be its doom at some point in the future. By that time I also read a great article or blog post regarding the complexity of using the tidyverse family of packages in R, especially with the task of teaching R to beginners. The text made excellent points discussing how the syntax of tidyverse packages is so different from the base R functions that it might confuse the people trying to learn R from scratch. Thus, the narration continued towards the use of the packages data.table instead, which maintains a syntax closer to that of base R. And from there, the author also took the opportunity to discuss efficiency of both packages. I apologize for the lack of sources but I could not find the link to the post(s) I’m referring to, if any of you knows the post I’m talking about please, share the link with me, I’d be greatly thankful.
Regardless of that line of thinking, I believe that we can all feel lucky to have packages such as tidyverse and data.table which make time processing of big data easier, among other advantages. And these are only the beginning to the list of packages. Although I was interested in the topic myself, I never run some “formal tests” to compare the efficiency of this or other packages (although I was comparing a few languages including Julia, Common Lisp and of course, Python, similarly to niklas-heer in his speed-comparison repo, to whom I also thank for my head image). Nevertheless, in the last couple of weeks I had to do such tests due to the nature of my current job.
I recently joined a project where the team has been developing a mapper and wrapper of data using R, where the input data can vary from 2 rows to a few millions. The whole process runs through couple of servers to import the data into R, process it accordingly and send it out to a data base from where is served into some other software. The whole process per-se is quite complex and due to the use of servers and Internet connections it can become quite slow. Thus, it is critical that the time processing in R is efficient.
As mentioned before, a team has been working on this project for a while and they were using the tidyverse family of packages a lot. Myself I prefer to stick to base R functions when it comes to development. I think it makes the work neat, simple and easier, keeps the dependencies to the minimum and, since I know R for more than 10 years, it’s easier for me to write code in base R. And please, don’t misunderstand me, I like the tidyverse functions but I rather use them when it comes to data analysis: it is great to clean data, prepare it to fit models, explore it, and of course, to make visualizations with the wonderful ggplot preceded by the %>% sequence to provide exactly what is needed. But for me, developing a software in base R is just more straight forward.
However, as I said, efficiency is critical in this project and thus, I’ve been tasked to test it in a few functions already. The most recent was a function to import JSON files line by line using dplyr functions which I could reduce by half the time using data.table functions, but that’s a topic for another time. One of the first tasks I was given as a new member was to map a process, very similar to another one but with different input parameters. I could had simply copied the code from the previous mapping process into my own script and just change the parameters, since the mapping logic is exactly the same. However, I decided to create my own code using base R, trusting that is more straight forward and efficient, and at the same time taking the opportunity to show up my skills to my new team. Therefore, I ended up comparing the efficiency of the functions using Monte Carlo simulations and thus, creating the present post. I hope it can be useful for some of you.
Image 1. Credits – https://github.com/niklas-heer/speed-comparison
The task
The general idea is to map a RESPONSE based on the contents of one column, in this case CODE1: all values get the response “BATCH”, but only when CODE1 is empty, they also get the response “GETTING”. Rows with value “BATCH” get renamed the columns NAME, DAY and TIME into TEAM, RESPONSETD and RESPONSESTT respectively, while rows with response “GETTING” only get one more column: NAME into newly named column TEAM.
(test.df NAME DAY TIME CODE1 > 1 1 20-10-22 18:37:23 Code > 2 2 20-10-22 18:37:23
> 3 3 20-10-22 18:37:23 Code > 4 4 20-10-22 18:37:23
> 5 5 20-10-22 18:37:23 Code > 6 6 20-10-22 18:37:23
> 7 7 20-10-22 18:37:23 Code > 8 8 20-10-22 18:37:23
> 9 9 20-10-22 18:37:23 Code > 10 10 20-10-22 18:37:23
The whole general idea is to create a new table with response values, which follows and is followed by a series of adjustments to the data. For the post I have created a test data frame with simple values, in case somebody would like to reproduce the code execution.
rename_nCols