Dockerfile strategies for Git

41,705

Solution 1

From Ryan Baumann's blog post “Git strategies for Docker”

There are different strategies for getting your Git source code into a Docker build. Many of these have different ways of interacting with Docker’s caching mechanisms, and may be more or less appropriately suited to your project and how you intend to use Docker.

RUN git clone

If you’re like me, this is the approach that first springs to mind when you see the commands available to you in a Dockerfile. The trouble with this is that it can interact in several unintuitive ways with Docker’s build caching mechanisms. For example, if you make an update to your git repository, and then re-run the docker build which has a RUN git clone command, you may or may not get the new commit(s) depending on if the preceding Dockerfile commands have invalidated the cache.

One way to get around this is to use docker build --no-cache, but then if there are any time-intensive commands preceding the clone they’ll have to run again too.

Another issue is that you (or someone you’ve distributed your Dockerfile to) may unexpectedly come back to a broken build later on when the upstream git repository updates.

A two-birds-one-stone approach to this while still using RUN git clone is to put it on one line1 with a specific revision checkout, e.g.:

RUN git clone https://github.com/example/example.git && cd example && git checkout 0123abcdef

Then updating the revision to check out in the Dockerfile will invalidate the cache at that line and cause the clone/checkout to run.

One possible drawback to this approach in general is that you have to have git installed in your container.

RUN curl or ADD a tag/commit tarball URL

This avoids having to have git installed in your container environment, and can benefit from being explicit about when the cache will break (i.e. if the tag/revision is part of the URL, that URL change will bust the cache). Note that if you use the Dockerfile ADD command to copy from a remote URL, the file will be downloaded every time you run the build, and the HTTP Last-Modified header will also be used to invalidate the cache.

You can see this approach used in the golang Dockerfile.

Git submodules inside Dockerfile repository

If you keep your Dockerfile and Docker build in a separate repository from your source code, or your Docker build requires multiple source repositories, using git submodules (or git subtrees) in this repository may be a valid way to get your source repos into your build context. This avoids some concerns with Docker caching and upstream updating, as you lock the upstream revision in your submodule/subtree specification. Updating them will break your Docker cache as it changes the build context.

Note that this only gets the files into your Docker build context, you still need to use ADD commands in your Dockerfile to copy those paths to where you expect them in the container.

You can see this approach used in the here

Dockerfile inside git repository

Here, you just have your Dockerfile in the same git repository alongside the code you want to build/test/deploy, so it automatically gets sent as part of the build context, so you can e.g. ADD . /project to copy the context into the container. The advantage to this is that you can test changes without having to potentially commit/push them to get them into a test docker build; the disadvantage is that every time you modify any files in your working directory it will invalidate the cache at the ADD command. Sending the build context for a large source/data directory can also be time-consuming. So if you use this approach, you may also want to make judicious use of the .dockerignore file, including doing things like ignoring everything in your .gitignore and possibly the .git directory itself.

Volume mapping

If you’re using Docker to set up a dev/test environment that you want to share among a wide variety of source repos on your host machine, mounting a host directory as a data volume may be a viable strategy. This gives you the ability to specify which directories you want to include at docker run-time, and avoids concerns about docker build caching, but none of this will be shared among other users of your Dockerfile or container image.

Solution 2

You have generally two approaches:

  • referencing a vault where you get your secret data necessary to access what you need to put in your image (here, your ssh keys to access your private repo)

Update 2018: see "How to keep your container secrets secure", which includes:

  • Use volume mounts to pass secrets to a container at runtime
  • Have a plan for rotating secrets
  • Make sure your secrets are encrypted

  • or a squashing technique (not recommended, see comment)

For the second approach, see "Pulling Git into a Docker image without leaving SSH keys behind"

  • Add the private key to the Dockerfile
  • Add it to the ssh-agent
  • Run the commands that require SSH authentication
  • Remove the private key

Dockerfile:

ADD ~/.ssh/mykey /tmp/  
RUN ssh-agent /tmp  
# RUN bundle install or similar command
RUN rm /tmp/mykey  

Let’s build the image now:

$ docker build -t original .
  • Squash the layers:

    docker save original | sudo docker-squash -t squashed | docker load
    

Solution 3

There are several strategies I can think of:

Option A: Single stage inside the Dockerfile:

ADD ssh-private-key /root/.ssh/id_rsa
RUN git clone git@host:repo/path.git

This has the several significant downsides:

  • Your private key is inside the docker image.
  • The step will be cached from a previous build on later builds, even when your repo changes, unless you break the cache on an earlier step. That's because the RUN line is unchanged.

Option B: Multi-stage inside the Dockerfile:

FROM base-image as clone
ADD ssh-private-key /root/.ssh/id_rsa
RUN git clone git@host:repo/path.git
RUN rm -rf /path/.git

FROM base-image as build
COPY --from=clone /path /path
...

By using the multi-stage, your ssh credentials are now only on the build host as long as you never push your "clone" stage layers anywhere. This is slightly better, but still has caching issues (see the tip at the end). By adding the rm step, the later COPY --from will no longer copy those files. Since the build image or later should be all you ship, being inefficient on the layers in the clone stage is less of a concern.

Option C: From your CI server:

Typically, the Dockerfile is in the code repo, and people tend to clone this first, before running the build (though it is possible to skip this by using a git repo as a build context). Therefore you'll often see CI servers perform the clone and update rather than the Dockerfile itself. The resulting Dockerfile is then just:

COPY path /path

This has several advantages:

  • The credentials never get added to the docker image layers.
  • Updating the repo doesn't rerunning the clone from scratch, the previous clone is already there and you can run a git pull instead, which is much faster.
  • Copying files into the image can include .git inside of the .dockerignore to exclude all of the git internals. Therefore you only add the final state of the repo to your docker image, resulting in a much smaller image.

Admittedly, this option is saying "don't do that" to your question, but it's also the most popular option I've seen from people facing this challenge, for good reason.

Option D: With BuildKit:

BuildKit has several experimental features that may be useful. These require newer versions of Docker that may not be on every build host, and the syntax to inject the options is not backwards compatible. The main two options are secrets or ssh credential injection, and cache directories. Both of these can inject a file or directory into the build step that is not saved into the resulting image layers. Here's what that could look like (this is untested):

# syntax=docker/dockerfile:experimental
FROM base-image
ARG CACHE_BUST
RUN --mount=type=cache,target=/git-cache,id=git-cache,sharing=locked \
    --mount=type=secret,id=ssh,target=/root/.ssh/id_rsa \
    if [ ! -d /git-cache/path/.git ]; then \
      git clone git@host:repo/path.git /git-cache/path; \
    else \
      (cd /git-cache/path && git pull --force); \
    fi; \
    tar -cC /git-cache/path --exclude .git . | tar -xC /path

And then the build would look like:

DOCKER_BUILDKIT=1 docker build \
  --secret id=ssh,src=$HOME/.ssh/id_rsa \
  --build-arg "CACHE_BUST=$(date +%s)" \
  -t img:tag \
  .

This is fairly convoluted, but has a few advantages:

  • The cache directory keeps the git repo from the last build, saving a large clone for every build, only pulling the changes.
  • The tar command was basically a copy that excluded the .git directory from the final image, making your image smaller. This copy is needed since the cache directory is not saved into the resulting image layers.
  • The ssh credentials were injected as a secret that appears similar to a single file read-only volume mount for that specific RUN step, and the contents of that secret were not saved to the resulting image layer.

To read more about BuildKit's experimental features, see: https://github.com/moby/buildkit/blob/master/frontend/dockerfile/docs/experimental.md

Tip: Cache busting a specific line:

To bust the docker build cache on a specific line, you can inject a build arg that changes on every build right before the RUN line that you want to rerun. In the BuildKit example, there was the:

ARG CACHE_BUST

before the RUN line that I did not want to cache, and the build included:

--build-arg "CACHE_BUST=$(date +%s)"

to inject a unique variable for each build. This ensures the build always runs that step, even though the command is otherwise unchanged. The build arg is injected as an environment variable to the RUN so docker then sees this command has changed and cannot be reused from the cache.

Ideally, you would clone a specific tag or commit id, which allows you to cache builds that use that same git clone from previous builds. However, if you are cloning master, this cache busting technique will be needed.

Share:
41,705
Hemerson Varela
Author by

Hemerson Varela

Sr. Software Developer with experience developing complex web based applications. Has a strong background in front end and back end development, including database and software design. Possesses strong technical skills and the ability to learn new technologies.

Updated on July 09, 2022

Comments

  • Hemerson Varela
    Hemerson Varela almost 2 years

    What is the best strategy to clone a private Git repository into a Docker container using a Dockerfile? Pros/Cons?

    I know that I can add commands on Dockerfile in order to clone my private repository into a docker container. But I would like to know which different approaches people have used on this case.

    It’s not covered in the Dockerfile best practices guide.

  • alexvicegrab
    alexvicegrab over 6 years
    Adding the private key will probably add a layer to the Dockerfile meaning that they are stored there and can be retrieved by someone smart enough, even if "removed" afterwards
  • VonC
    VonC over 6 years
    @alexvicegrab Good point. 2+ years later, I have edited the answer to make clearer the first approach is preferable.
  • SunnyPro
    SunnyPro over 2 years
    The question is specifically about cloning a private repo. Your suggestions will not work for private repos.
  • mikeLundquist
    mikeLundquist about 2 years
    If you use submodules BE AWARE their .git folder is stored in the main repo! You have to copy over the main repo's .git folder to use git commands for the submodule.