kforner

an elegant way to fix user IDs in docker containers using docker_userid_fixer

Karl Forner — Tue, 13 Aug 2024 22:00:00 GMT

what is it about?

It’s about a rather technical issue in using docker containers that interact with the docker host computer, generally related to using the host filesystem inside the container. That happens in particular in reproducible research context. I developed an opensource utility that helps tackling that issue.

docker containers as execution environments

The initial and main use case of a docker container: a self-contained application that only interacts with the host system with some network ports. Think of a web application: the docker container typically contains a web server and a web application, running for example on port 80 (inside the container). The container is then run on the host, by binding the container internal port 80 to a host port (e.g. 8000). Then the only interaction between the containerized app and the host system is via this bound network port.

Containers as execution environments are completely different:

instead of containerizing an application, it’s the application build system that is containerized.
- it could a be a compiler, an IDE, a notebook engine, a Quarto publishing system…
the goals are:
- to have an standard, easy to install and share environment
  - imagine a complex build environment, with fixed versions of R, python and zillions of external packages. Installing everything with the right versions can be a very difficult and time-consuming task. By sharing a docker image containing everything already installed and pre-configured is a real time-saver.
- to have a reproducible environment
  - by using it, you are able to reproduce some analysis results, since you are using very same controlled environment
  - you can also easily reproduce bugs, which is the first step to fixing them

But, in order to use those execution environments, those containers must have access to the host system, in particular to the host user filesystem.

docker containers and the host filesystem

Suppose you have containerized an IDE, e.g. Rstudio. Your Rstudio is installed and running inside the docker container, but it needs to read and edit files in your project folder.

For that you bind mount your project folder (in your host filesystem) using the docker run --volume option. Then your files are accessible from withing the docker container.

The challenge now are the file permissions. Suppose your host user has userid 1001, and suppose that the user owning the Rsudio process in the container is either 0 (root), or 1002.

If the container user is root, then it will have no issue in reading your files. But as soon as you edit some existing files, are produce new ones (e.g. pdf, html), these files will belong to root also on the host filesystem! Meaning that your local host user will not be able to use them, or delete them, since they belong to root.

Now if the container user id is 1002, Rstudio may not be able to read your files, edit them or produce new files. Even if it can, by settings some very permissive permissions, your local host user may not be able to use them.

Of course one bruteforce way of solving that issue is to run with root both on the host computer and withing the docker container. This is not always possible and raise some obvious critical security concerns.

solving the file owner issue part 1: the docker run `--user` option

Because we can not know in advance what will be the host userid (here 1001), we can not pre-configure the userid of the docker container user.

docker run now provides a --user option that enables to create a pseudo user with some supplied userid at runtime. For example, docker run --user 1001 ... will create a docker container running with processes belonging to a user with userid 1001.

So what are we still discussing this issue? Isn’t it solved?

Here some quirks about that dynamically created user:

it is a pseudo user
it does not have a home directory (/home/xxx)
it does not appear in /etc/passwd
it can not be preconfigured, e.g. with a bash profile, some env vars, application defaults etc…

We can work-around these problems, but it can be tedious and frustrating. What we’d really like, is to pre-configure a docker container user, and be able to dynamically change his userid at runtime…

solving the file owner issue part 2: enter `docker_userid_fixer`

docker_userid_fixer is an open source utility intended to be used as a docker entrypoint to fix the userid issue I just raised.

Let’s see how to use it: you set it as your docker ENTRYPOINT, specifying which user should be used and have his userid dynamically modified:

ENTRYPOINT ["/usr/local/bin/docker_userid_fixer","user1"]

Let’s be precise in our terms:

the target user, is the user requested to docker_userid_fixer, here user1
the requested user, is the user provisioned by docker run, i.e the user that (intially) owns the first process (PID 1)

Then, at the container runtime creation, there are two options:

either the requested userid (already) matches the target userid, then nothing has to be changed
or it does not. For example the requested userid is 1001, and the target userid is 100. Then, docker_userid_fixer will fix the userid of the target user user1 from 1000 to 1001, directly in the container main process.

So in practice this solves our issue:

if you do not need to fix your container userid, just use docker run the usual way (without the --user option)
or you use --user option, then in addition of running your main process with a userid you requested, it will modify your pre-configured user to your requested userid, so that your container is running with your intended user and intended userid.

docker_userid_fixer setup

You can find instructions about the setup here.

But it boils down to:

build or download the tiny executable (17k)
copy it into your docker image
make it executable as setuid root
configure it as your entrypoint

the gory details

I have put some short notes https://github.com/kforner/docker_userid_fixer#how-it-works but I’ll try to rephrase.

The crux of the implementation is the setuid root of the docker_userid_fixer executable in the container. We need root permissions to change the userid, and this setuid enables that privileged execution only for the docker_userid_fixerprogram, and that for a very short time.

As soon as the userid has been modified if needed, docker_userid_fixer will switch the main process to the requested user (and userid!).

Organizing R development using srcpkgs

Karl Forner — Sun, 26 May 2024 22:00:00 GMT

Overview

This is an introduction on organizing R projects using source packages (powered by my R package srcpkgs). It is based on notes for a talk I have on 2024-05-27 for the Swiss Institute of Bioinformatics Vital-IT group Analysts meeting.

The obecjtive is to organize R projects in order to:

reuse code
share code
increase robustness
enable analysis (code) reproducibility

The context is mostly for analysis oriented R projects.

R packages

All R users use R packages, the core ones such as base, stats, tools, and some from CRAN or BioConductor.

Why would you want to use R packages for your own code???

a R package is:

self-contained
- it bundles together all related code, the documentation, the relevant data and tests
the dependencies are explicitly stated, and are themselves R packages

On the natural evolution of code projects…

My view on the general evolution of analysis projects:

you start with a single script, sequential, with no functions
at one point (after writing hundreds or thousands of lines) you realize that you need some functions
then you start reusing those functions across projects by copy/paste. This raises a number of problems
- versioning: at one point you will fix or improve such a function
  - it may be difficult to remember which project contains the latest version
  - what of the projects that contain the incorrect versions?
then you may want, if you work in a team, to share this code with colleagues, or to use theirs
- –> it requires some documentation, even terse.
- there’s a increased responsibility. What if your code is wrong and impact the projects of your colleagues? One remedy is to write tests for those functions.
- those functions are seldom independent, so that you can not just pick one
- all those functions are exposed (i.e public or exported).
  - if you start to use a low-level function in your project, and that in the next version it has been refactored and that this function has been changed, or removed, updating the shared code will break the project.
for all those reasons you start packaging your reusable code as a R package
- you can add documentation, tests, group code logically. It brings a namespace so that you can decide what you expose.
But… it does NOT really solve the versioning problem
- in R, packages have to be installed (e.g. using install.packages()) before you can use them with library(mypkg)
- packages have a version number (N.B: this is not the same as code versioning)
- if you use version v1 in your project A, and version v2 in project B, you have to juggle with versions (install/uninstall) Of course there are some tools to deal with that (renv…) but they work with external packages (or you need some private custom repositories)
- and it’s very cumbersome. Suppose that in your project A you find a bug in the (installed package). In order to fix it, you need to
  - fetch the source code of the package
  - try to reproduce your problem. Chances are that you need your project data, you have to reproduce your session
  - finally, if you manage to fix it. You have to publish it, install it.
my approach is to use what I call R source packages
- they are normal R packages, but instead of installing them on your R system, you load them directly from source in your R session.
- it was made possible by the infamous Hadley Wickham, and his devtools::load_all() function, that mimics the loading of an installed package
- this greatly helps with all those problems:
  - you embed your source packages inside your project (as git submodules, we’ll that see later) this solves the versioning/reproducibility at your reusable code level: all your projects may use a different version
- if you need to fix a bug, or improve and augment your reusable code, it’s a simple as editing the code for your project. And using srcpkgs, you can even easily reload the code inside your existing R sessions, without losing any computed data.
so far so good. Then for ease of maintenance/modularity, you start splitting your reusable code by category, and develop several R packages, e.g. one for some misc utilities, one for loading data from your database, one for some specific analysis…
- this is where srcpkgs become usefuls, since devtools was designed to manage a single R source package, not a collection/library of possibly inter-dependent packages.
  - additionally has a useful little hack that enables you to use the standard library() function to load your source packages. So that when you analysis is finalized, or deployed in production, with your packages installed in the standard way, your script will continue to worl without any change.
But this does not solve the reproducibility for the external packages
- your code and source library most certainly use external packages, and also depend on your R version (and thus on the bioconductor version)
- it may also depend on your OS architecture (CPU…)
- this is out of scope for that talk, but one solution for that is to use a virtualized development environment: a docker container (cf https://rocker-project.org/) that contains a fixed version of R, and of all the needed external packages.
- now the challenge is to synchronize that docker container version with your source library version…
- also cf devcontainers

Summary

script --> script+functions --> script + source files --> R package --> R source package --> R source library [ + R docker env]

My recommended project setup

the source library of R packages
- should be a single dedicated git repository
  - recommended since it’s easier to have consistent versions of interdependent packages
  - but each package could be in its own git repository if needed
- each package should contain tests (very important, even if it’s counter intuitive, but there is usually more value in the test suite than in the code itself, don’t get me started on that…)
- for internal packages, especially for a public of developers I personally that the documentation is less important, for example that for a publicly released package.
- you should use CI (Continuous Integration, like github actions or gitlab CI) to automatically run the automated tests each time you push to the repository.
- also, reporting the test coverage is important
the project code
- MUST be versioned in a git repository (in github/gitlab…): one repository per project
- should itself be a R (source) package
  - easier to add tests, documentation, vignettes
- but can be a single script or a set of source files
- contain a given version (commit/tag/branch) of the source library as a git submodule
- should contain a vscode devcontainer to execute the project’s code (automatically usable via github codespaces)
the project R code will then use the srcpkgs package, that will automatically discover the R packages contained in the project folder, and transparently load them using the hacked library() function as if they were installed packages.

Resources:

the github repository of srcpkgs
the online documentation
- notably this demo vignette: why would you need srcpkgs?

I (Karl Forner) am currently working as a consultant, contact me if you want me to help you on using R, organizing development, developing R packages or more generally support your software development efforts.