Organizing R development using srcpkgs

R
srcpkgs
dev
Author

Karl Forner

Published

May 27, 2024

Overview

This is an introduction on organizing R projects using source packages (powered by my R package srcpkgs). It is based on notes for a talk I have on 2024-05-27 for the Swiss Institute of Bioinformatics Vital-IT group Analysts meeting.

The obecjtive is to organize R projects in order to:

  • reuse code
  • share code
  • increase robustness
  • enable analysis (code) reproducibility

The context is mostly for analysis oriented R projects.

R packages

All R users use R packages, the core ones such as base, stats, tools, and some from CRAN or BioConductor.

Why would you want to use R packages for your own code???

a R package is:

  • self-contained
    • it bundles together all related code, the documentation, the relevant data and tests
  • the dependencies are explicitly stated, and are themselves R packages

On the natural evolution of code projects…

My view on the general evolution of analysis projects:

  • you start with a single script, sequential, with no functions

  • at one point (after writing hundreds or thousands of lines) you realize that you need some functions

  • then you start reusing those functions across projects by copy/paste. This raises a number of problems

    • versioning: at one point you will fix or improve such a function
      • it may be difficult to remember which project contains the latest version
      • what of the projects that contain the incorrect versions?
  • then you may want, if you work in a team, to share this code with colleagues, or to use theirs

    • –> it requires some documentation, even terse.
    • there’s a increased responsibility. What if your code is wrong and impact the projects of your colleagues? One remedy is to write tests for those functions.
    • those functions are seldom independent, so that you can not just pick one
    • all those functions are exposed (i.e public or exported).
      • if you start to use a low-level function in your project, and that in the next version it has been refactored and that this function has been changed, or removed, updating the shared code will break the project.
  • for all those reasons you start packaging your reusable code as a R package

    • you can add documentation, tests, group code logically. It brings a namespace so that you can decide what you expose.
  • But… it does NOT really solve the versioning problem

    • in R, packages have to be installed (e.g. using install.packages()) before you can use them with library(mypkg)
    • packages have a version number (N.B: this is not the same as code versioning)
    • if you use version v1 in your project A, and version v2 in project B, you have to juggle with versions (install/uninstall) Of course there are some tools to deal with that (renv…) but they work with external packages (or you need some private custom repositories)
    • and it’s very cumbersome. Suppose that in your project A you find a bug in the (installed package). In order to fix it, you need to
      • fetch the source code of the package
      • try to reproduce your problem. Chances are that you need your project data, you have to reproduce your session
      • finally, if you manage to fix it. You have to publish it, install it.
  • my approach is to use what I call R source packages

    • they are normal R packages, but instead of installing them on your R system, you load them directly from source in your R session.
    • it was made possible by the infamous Hadley Wickham, and his devtools::load_all() function, that mimics the loading of an installed package
    • this greatly helps with all those problems:
      • you embed your source packages inside your project (as git submodules, we’ll that see later) this solves the versioning/reproducibility at your reusable code level: all your projects may use a different version
    • if you need to fix a bug, or improve and augment your reusable code, it’s a simple as editing the code for your project. And using srcpkgs, you can even easily reload the code inside your existing R sessions, without losing any computed data.
  • so far so good. Then for ease of maintenance/modularity, you start splitting your reusable code by category, and develop several R packages, e.g. one for some misc utilities, one for loading data from your database, one for some specific analysis…

    • this is where srcpkgs become usefuls, since devtools was designed to manage a single R source package, not a collection/library of possibly inter-dependent packages.
      • additionally has a useful little hack that enables you to use the standard library() function to load your source packages. So that when you analysis is finalized, or deployed in production, with your packages installed in the standard way, your script will continue to worl without any change.
  • But this does not solve the reproducibility for the external packages

    • your code and source library most certainly use external packages, and also depend on your R version (and thus on the bioconductor version)
    • it may also depend on your OS architecture (CPU…)
    • this is out of scope for that talk, but one solution for that is to use a virtualized development environment: a docker container (cf https://rocker-project.org/) that contains a fixed version of R, and of all the needed external packages.
    • now the challenge is to synchronize that docker container version with your source library version…
    • also cf devcontainers

Summary

script --> script+functions --> script + source files --> R package --> R source package --> R source library [ + R docker env]

Resources:

I (Karl Forner) am currently working as a consultant, contact me if you want me to help you on using R, organizing development, developing R packages or more generally support your software development efforts.