Logo

The Data Daily

R function to fill in merged cells | R-bloggers

R function to fill in merged cells | R-bloggers

R function to fill in merged cells
Posted on December 4, 2021 by R with White Dwarf in R bloggers | 0 Comments
[This article was first published on R with White Dwarf , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here )
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Share Tweet
This post is part of our series on functions in R . You can see our previous post if you want to understand the basics but it is not strictly necessary. Here we will go into detail about for loops and if statements in R, two key elements of any function. We are going to define a process, map it in a step-by-step approach and wrape it in a function that can repeat it automatically. Even if you have a very basic understanding of R you should be able to follow this tutorial without problem.
Our outcome will be an R function that by calling it, is able to fill in empty rows generated from imported sheets (like excel) when it contains merged cells. If you are only interested in the function itself you can go to the end of the post and find it in the section Final remarks .
Description of the problem
In our previous post we saw the basics for creating functions, yet using silly examples with not much of practical usage. Now we are ready to write a function that can have more practical use.
There are different ways how to import data sheets (i.e. from excel) to R. Regardless of its limitations, these sheets are widely used in data analysis today. If you are used to do data analysis with a different software you should be familiar with the complications of sorting your data imported from sheets when there are merged cells in the rows. Usually, a file like below
results in a table like this
Specie
NA
11
when the amount of rows to be filled in is small, there’s no big problem in copying and pasting the values. But as the DRY principle says, if we know how to create functions there is no need to do that, we can make a function that will do it automatically. This will specially pay off when you will have a table with hundreds or even thousands of cells merged. You might be thinking that nobody will merge cells for thousands of rows every 3 or 4 lines, but believe me, I have seen such things.
Keep in mind that this is mainly a tutorial for writting functions in R. It does not intend to deal with all the issues that migh appear with the importing of data such as merged columns or a mixture of both, among others. But if you have problems with that or are interested in the topic, leave us a comment and we can cover some points in a future post.
R function to fill in merged cells from excel
With today’s technology there are many ways to solve this problem. However sometimes the easiest way to import data to R as data analyst or statistician is by simply taking the working sheet containing the data and exporting it in csv format.
Regardless of the source (excel, libre office, google sheets, etc.) this method produces empty rows by default. When some rows have been merged because they belong to the same group or factor, the csv file will capture the value only on the first row and leave the rest empty until the next factor appear, where it again, will capture the value on the first row and leave the rest empty until the next factor appears. The process continues like this, iteratively until the end of the table. We basically need to copy the value stored on that first row and paste it to the empty rows, until a new factor appears.
Let’s map the process in terms of R steps to complete our task.
Maping the process
We will start by calling the table. If you have an excel, libre office calc or google sheet file with merged rows as our example above feel free to use it. Otherwise you can quickly simulate one similar to the image above. Start by exporting the sheet of interest to csv, then we call it using read.csv
my.table Specie Dup Treat Rep Value > 1 A. cap A 0 1 34 > 2 AA NA NA 26 > 3 A 25 NA 18 > 4 AA NA NA 24 > 5 A 50 NA 11 > 6 AA NA NA 12 > 7 A 100 NA 15 > 8 AA NA NA 11 > 9 F. rub F 0 NA 25 > 10 FF NA NA 26
Here we can see the first 10 rows containing NA for numeric columns and empty string for character columns. Now let’s go step by step to fill empty values.
1. Identify and capture the factor
Let’s start with the first column Specie (an experiment was run for 3 different species of plants). We want to check if the first row contains a categorical value
no.row [1] "A. cap" 2. Copy it into the empty rows
Now we want to paste the value stored in category to all empty rows. So we first need to check if the next row is empty
no.row [1] ""
And when it is, we place the value contained in the previous row to our current row 2
my.table[['Specie']][no.row] Specie Dup Treat Rep Value > 1 A. cap A 0 1 34 > 2 A. cap AA NA NA 26 > 3 A 25 NA 18 > 4 AA NA NA 24 > 5 A 50 NA 11 > 6 AA NA NA 12
Now row 2 contains it’s categorical value, and when we move to row 3 (which is also empty), it can be copied from the previous row 2
no.row [1] "" my.table[['Specie']][no.row] Specie Dup Treat Rep Value > 1 A. cap A 0 1 34 > 2 A. cap AA NA NA 26 > 3 A. cap A 25 NA 18 > 4 AA NA NA 24 > 5 A 50 NA 11 > 6 AA NA NA 12 3. When a new factor appears, repeat the process
We can repeat this process until a new factor appears, as it is the case of row 9. Therefore we should not paste anything in row 9, and continue the process on row 10 which is also empty
no.row [1] "" my.table[['Specie']][no.row] Specie Dup Treat Rep Value > 1 A. cap A 0 1 34 > 2 A. cap AA NA NA 26 > 3 A. cap A 25 NA 18 > 4 AA NA NA 24 > 5 A 50 NA 11 > 6 AA NA NA 12 > 7 A 100 NA 15 > 8 AA NA NA 11 > 9 F. rub F 0 NA 25 > 10 F. rub FF NA NA 26 4. Repeat steps 1-3 for each column that needs it
The process moves forward in this way until the whole column Specie is filled in. Then we can move to the next column with empty values, in my case this is Treat.
If you look at the process, we basically need to write an R command for step 2. The rest is just a process of verification and repetition. We are going to automate verification using the function if() and the repetition using for().
Writing my first for loop
Since the present post is directed to R beginners with not much experience with programming or coding I will avoid all the technicalities of for loops and if statements and instead dive deeply into them by applying our logic above. Then we are going to use them and explain carefully to obtain a pragmatic understanding of the process.
We will start with a for loop to go row by row in one column and check what is inside, as described in the step 1 of our process. Let’s start with only 20 rows as an example
for(no.row in 1:20){ print(my.table[["Specie"]][no.row]) } > [1] "A. cap" > [1] "A. cap" > [1] "A. cap" > [1] "" > [1] "" > [1] "" > [1] "" > [1] "" > [1] "F. rub" > [1] "F. rub" > [1] "" > [1] "" > [1] "" > [1] "" > [1] "" > [1] "" > [1] "A. ela" > [1] "" > [1] "" > [1] ""
The function for()uses the first argument that you define (here no.row) and goes one by one in the series specified as the second argument, in this case a series of numbers from 1 to 20 (1:20). It means that in the first iteration no.row will take the value 1, in the second iteration the value 2, and so on, until the last iteration where it will have the value 20. Each iteration will execute the code inside the curly braces {} which, in this case, is simply to print the value of each row.
Because the first argument in for() is defined by us, we could as well use any arbitrary argument we want, for example
for(n in 1:20){ print(my.table[["Specie"]][n]) }
would do exactly the same but now n is taking the values from 1 to 20.
We can then initialize a function that takes the name of our data frame, the name of the column, and do exactly the same as our code above, but for all the rows contained in the table, no matter how many they are
fill_merged 2 A. cap AA NA NA 26 > 3 A. cap A 25 NA 18 > 4 A. cap AA NA NA 24 > 5 A. cap A 50 NA 11 > 6 A. cap AA NA NA 12 > 7 A. cap A 100 NA 15 > 8 A. cap AA NA NA 11 > 9 F. rub F 0 NA 25 > 10 F. rub FF NA NA 26 > 11 F. rub F 25 NA 17 > 12 F. rub FF NA NA 11 > 13 F. rub F 50 NA 13 > 14 F. rub FF NA NA 11 > 15 F. rub F 100 NA 11
Note that so far it works only for columns with character values, not numeric
fill_merged(my.table, 'Treat') > Error in if (dat[[column]][n] == "") {: missing value where TRUE/FALSE needed
The reason is that only character columns produce empty strings. Numeric columns will produce NA values. Therefore, we need to add a condition to our if expression to test also if the value is NA. To do that we use double | which in R means OR
fill_merged [1] TRUE
Now we can use our function for columns with numeric values also
my.filled.table Specie Dup Treat Rep Value > 1 A. cap A 0 1 34 > 2 A. cap AA 0 NA 26 > 3 A. cap A 25 NA 18 > 4 AA 25 NA 24 > 5 A 50 NA 11 > 6 AA 50 NA 12 > 7 A 100 NA 15 > 8 AA 100 NA 11 > 9 F. rub F 0 NA 25 > 10 F. rub FF 0 NA 26
Check, confirm and repeat (for and if together)
Now we could use our function for each column that presents this issue, but we are actually trying to apply the DRY principle. Instead we could use for() to go through all the columns where we want to apply it. There are other functions that can help with this as well such as map() and its derived functions from the package purrr .We could also write a new function that calls our first function to repeat it into each column. Feel free to experiment, for this tutorial we are going to take a different approach.
We are going to improve the same old function and add yet one more for loop that iterates from each column of interest and repeats the same process. This will cover the step 4 of our mapped process.
The implementation is actually easy, we just need to wrap the whole previous process of if’s and for’s inside a for loop that goes column by column
fill_merged 3 A. cap A 25 1 18 > 4 A. cap AA 25 1 24 > 5 A. cap A 50 1 11 > 6 A. cap AA 50 1 12 > 7 A. cap A 100 1 15 > 8 A. cap AA 100 1 11 > 9 F. rub F 0 1 25 > 10 F. rub FF 0 1 26 > 11 F. rub F 25 1 17 > 12 F. rub FF 25 1 11 > 13 F. rub F 50 1 13 > 14 F. rub FF 50 1 11 > 15 F. rub F 100 1 11
Also notice the trick in the first line: names(my.table) returns a vector containing all the column names of the data frame and names(my.table)[c(1,3,4)] is taking only the columns 1, 3 and 4, which are the ones that need to be fixed. This is extremely useful when you have many columns that need to be fixed
Marking the errors
You might have noticed that I’m adding text preceded by ## within the function. If you are not familiar with it, this are comments, it means that anything that is written in the same line after one # will not be evaluated by R (I use double for technical reasons of my text editor). Although this function is quite small and simple, and we know what exactly we are doing on each line thanks to the explanations, it is a good practice to add comments to your code because after a while, when you will look back at the code you might had forgotten the logic and structure. Adding comments help us to know what each piece is about, making it easier to apply changes in the future.
In the same way, it is a good practice to add errors when the function is expecting something in particular and we can foresee potential problems. Often we don’t foresee all the mistakes and problems that the user or we ourselves can have when using our own functions and thus, errors are usually added along the way based on the experience gathered by using the function.
For example, our function is expecting that at least the first row will not have empty values, otherwise it cannot go one row before to find the value to paste on it. Although it is not expected, our table can still present this situation due to human errors, for example, somebody by accident pressed Delete button somewhere on the first row in the source file. In such case R will mark some error that will be difficult to understand and track back. We might wonder for hours what we did wrong in our function only to find out that the problem comes from the data table itself. Instead we can mark our own error in advance by sending a message when the value on the first row is missing.
An easy way of implementing this is using the function stop(). Let’s implement our error into our function right before it copies the value from the row n - 1.
fill_merged

Images Powered by Shutterstock