Using cache in GitLab CI with Docker-in-Docker

Published on 10/2/20 at 10:32 PM. Updated on 10/4/20 at 3:16 PM.

An important part of pipelines: build time ! And using cache in CI helps a lot.

Using cache in GitLab CI with Docker-in-Docker

Photo by Marcin Jozwiak on Unsplash
Icon made by Becris from flaticon


In the previous post we saw the details about Docker image building in GitLab CI. But we did not talk about how to optimize the build time using caches.

Docker good practices

Before going into optimizations at the GitLab CI level, we need to optimize our images. You can find these tips in Docker's documentation so I'll try to summarize some points.

Instructions order of a Dockerfile

When you build an image, it is made of multiple layers: we add a layer per instruction.
If we build the same image again without modifying any file, Docker will use existing layers rather than re-executing the instructions. If a file has changed for a single instruction, then the cache will be invalidated for that instruction and the following ones.

Little course extract to explain:
The 2nd instruction has changed, the cache is not used for the instructions 2 & 3.

By changing the instructions order, I optimize cache usage by putting the instruction most likely to change at the end:

That's the most important point: an image is made of multiple layers, and we can accelerate its build by using layers cache from the previous image version.

Other tips

The 1st tip is the most important for the post, but since we're here... To optimize your images in different ways, you can also:

Using Docker image layer caching in CI

The issue with images building in CI is that by using Docker-in-Docker, we get a fresh Docker instance per job which local registry is empty. And if we don't have image to base our build on: no cache.

No panic, we can fix that easily !

The --cache-from option

Since Docker 1.13 we can use the --cache-from option with the build command to specify which image to use cache from:

docker build --cache-from image:old -t image:new -f ./Dockerfile .

We just need to login to our GitLab project registry and to use our build on the most recent image:

build:
    before_script:
        - echo "$CI_REGISTRY_PASSWORD" | docker login -u "$CI_REGISTRY_USER" --password-stdin $CI_REGISTRY
        - docker pull "$CI_REGISTRY_IMAGE:latest"
    script:
        - docker build --cache-from "$CI_REGISTRY_IMAGE:latest" -t "$CI_REGISTRY_IMAGE:new-tag" -f ./Dockerfile .

What about Git tags ?

If like me you have a dedicated Dockerfile for production, you won't gain anything by using cache from a dévelopment image: cache will always be invalidated at some point.
But if you maintain a CHANGELOG in this format, and/or your Git tags are also your Docker tags, you can get the previous version and use cache the this image version.

Example with a previous version fetched from the CHANGELOG.md:

release:
    before_script:
        - echo "$CI_REGISTRY_PASSWORD" | docker login -u "$CI_REGISTRY_USER" --password-stdin $CI_REGISTRY
    script:
        - export PREVIOUS_VERSION=$(perl -lne 'print "v${1}" if /^##\s\[(\d\.\d\.\d)\]\s-\s\d{4}(?:-\d{2}){2}\s*$/' CHANGELOG.md | sed -n '2 p')
        - docker build --cache-from "$CI_REGISTRY_IMAGE:$PREVIOUS_VERSION" -t "$CI_REGISTRY_IMAGE:$CI_COMMIT_TAG" -f ./prod.Dockerfile .

« Why from the changelog and not just using git tag ? »
In the changelog, I can specify if a version is unstable so it won't be used. That said, it's not part of this changelog specification.

Caching dependencies

For most of the projects, « Docker layer caching » is enough to optimize the build time. But we can try to go further.

Caching in CI/CD

Cache in CI/CD is about saving directories or files across pipelines. Most of the time we set a name to the cache items to share them across jobs in multiple ways.
This cache avoids reinstalling dependencies for each build.

The issue here ? We're building a Docker image, dependencies are installed inside a container.
We can't cache a dependencies directory if it doesn't exists in the job workspace.

Caching during image build

To start, we must remove directories to cache from .dockerignore files. Dependencies will always be installed from a container but will be extracted by the GitLab Runner in the job workspace. Our goal is to send the cached version in the build context.

We set the directories to cache in the job settings with a key to share the cache per branch and stage.
Then we create a container without starting it with the docker create command and copy the dependency directories from the container to the host with docker cp:

build:
    stage: build
    cache:
        key: "$CI_JOB_STAGE-$CI_COMMIT_REF_SLUG"
        paths:
            - vendor/
            - node_modules/
    before_script:
        - echo "$CI_REGISTRY_PASSWORD" | docker login -u "$CI_REGISTRY_USER" --password-stdin $CI_REGISTRY
        - docker pull "$CI_REGISTRY_IMAGE:latest"
    script:
        - docker build --cache-from "$CI_REGISTRY_IMAGE:latest" -t "$CI_REGISTRY_IMAGE:new-tag" -f ./Dockerfile .
    after_script:
        - docker create --name app "$CI_REGISTRY_IMAGE:new-tag"
        - rm -rf ./vendor
        - docker cp app:/var/www/html/vendor/ ./vendor
        - rm -rf ./node_modules
        - docker cp app:/var/www/html/node_modules/ ./node_modules

« And why deleting dependency directories before copying ? »
To avoid old dependencies to be mixed with the new ones, at the risk of keeping unused dependencies in cache, which would make cache and images heavier.

If you need to cache directories in testing jobs, it's easier: use volumes !

Honey, I enlarged the cache ...

By testing all of that before writing, I had images with an increasing size per build. A syntax error led to dependency directories saving into their previous version.

If like me, you tend to destroy everything, take good habits as soon as possible: version your cache keys !


I'm fine, thank you. And you ?

In my case, I realized the error after a few builds, cache was then corrupted as it already contained the « dependency-ception » I caused. Anything I would change to the code, the cache would have still been reused. It needs to be invalidated ! How ? By changing its key !

All your cache keys could have a -v1 suffix, then -v2 if you have an issue.

Bonus: sharing Docker image between jobs

Ideally in your pipeline you don't have to rebuild your Docker image for each job. So we build an image with a unique tag (e.g. using the commit hash).

I always set a variable for the pipeline image name:

variables:
   DOCKER_CI_IMAGE: "$CI_REGISTRY_IMAGE:ci-$CI_COMMIT_SHORT_SHA"

The 1st strategy is using GitLab CI's image registry, we docker push after the build, then we docker pull if needed in the next jobs.

But a 2nd method could be interesting, which could avoid to fill your registry with useless CI tags.

The save / load alternative

Alternatively to the « push / pull » method, we can save a Docker image as .tar archive and share it with another job by making the archive an artifact.

In every job, we automatically get artifacts from previous stages.

During the build, we create the archive using docker save and gzip:

build:
    stage: build
	artifacts:
	    paths:
		    - app.tar.gz
    cache: ...
    before_script: ...
    script:
        - docker build --cache-from "$CI_REGISTRY_IMAGE:latest" -t "$DOCKER_CI_IMAGE" -f ./Dockerfile .
    after_script:
        - ...
		- docker save $DOCKER_CI_IMAGE | gzip > app.tar.gz

During the other jobs, you don't need to login to the registry, but only load the image from the archive we get as an artifact:

test:
    stage: test
	before_script:
	    - docker load --input app.tar.gz
	script:
	    - docker run -d --name app $DOCKER_CI_IMAGE
		- ...

Conclusion

I personally use the « push / pull » technique, the other one did not satisfy me, but it probably depends on the project. These techniques made me gain at least 1 minute in my pipelines.
I hope you discovered some GitLab CI or Docker features.
I think we'll talk soon about GitLab CI's artifacts with PHP tests and static analysis tools.

Comments: 1

kordeviant 01/06/2022 - 09:31
I did the save load alternative but time of my pipeline didn't improve so much, using a local cache like docker on host has so much more effect.

Robot invasion coming from robohash.org