CorneR A small R Blog by Hao Zhu

"Build" a reproducible report, instead of "writing" one

(This blog post is largely built upon the inspiration I got after reading the 2nd chapter of Code Complete 2, which is a very practical book even for someone like me, who came into this field without a computer science background. I would highly recommend any data analysts or statistians who do a lot of programming works to read it.)

To some extent, the development of Rmarkdown (and knitr) is really revolutionary for scientific researches. For hundreds of years, scientists had spent way too much their valuable time on fighting with the formatting issues in their reports. In the old days, a last-minute change on the study data can be a real disaster to ruin a researcher's life, at least for a few hours. Now, with Rmarkdown, it's like every scientist can hire his/her own copy marker to make sure the reports are presented in a good shape. One of the key features of Rmarkdown is that the reports generated by Rmarkdown can update themselves based on changes made to the database. However, the question is that, for these "smart" reports, is it still appropriate to say we are "writing" them?

When we say we are "writing reports", it suggests that we are writing them like we are writing a letter or a novel. The practice of "writing" is relatively linear — you start from the beginning and finish at the end. If you want to add tables and figures, you should stop your "writing" process, go ahead to make them, add them into your report, and restart the "writing" process. If you project this process into the process of writing a Rmarkdown file, you are expecting to see something like this:

I have two dogs.
{r}
code.to.generate.a.picture.code.to.generate.a.picture.
code.to.generate.a.picture.code.to.generate.a.picture.
...
...
code.to.generate.a.picture.code.to.generate.a.picture.
code.to.generate.a.picture.code.to.generate.a.picture.
The bigger one is a Boarder Collie.

This format might still be okay if you just want to write a short instruction or a blog post. However, imagine that you are writing a 50-page paper with 20+ tables and 10+ figures, how's your life going to be when you want to re-read your source .rmd file if you threw everything into one basket? In such a case, the readability of the original .rmd file is so low that it is very difficult to maintain it. Needless to say what if your analytical codes produce an error message when you are tring to generate the file. It's going to be a nightmare to find out where the bug is and to fix it. After all, it goes against the design philosophy of the markdown language, which is to maximize the readability of the source codes. Texts mixed with page-long analytical codes are not really readable.

Now, let's take a lesson from the workflow of architects. When an architect is trying to build a house, first of all, (s)he would draw out the blue print of that project so he knows what is expected to be done. If (s)he wants to install cabinets into the kitchen, (s)he should either hire someone to build them or buy them directly from the furniture store. Then, on someday, some guys will move those cabinets into the house and put them on. If we adopt the architect's philosophy into our case, we will have a "blue print" in our mind, make the tables (cabinets) somewhere else, import them into our .rmd file (house) and we will only call those tables when we need them (final installation). Like this:

I have two dogs.
imagepath: public/IMG0074.png
The bigger one is a Boarder Collie.

See, a lot better, right?