In an earlier post, we looked at how to create a Docker image for SciDB. The image built in that post followed the SciDB Community Edition Installation Guide very closely. The image is functional and a good learning resource, but not very efficient. The image uses around 6GB of space and cannot be build automatically on Docker Hub due to long build time. In this post, we revisit this topic and try to build a more efficient Docker image for SciDB. The source files for the image is available on GitHub here. The image is available on Docker Hub here.

Note: The Docker image described in this post, is for SciDB 15.12 and for a single node installation. The GitHub and Docker Hub repositories also contain images for SciDB 15.7.

Docker Image Considerations

To have a smaller, more space efficient, Docker image we do the following:

  • Start from a small base image. In our case, we replace Ubuntu with Debian;
  • Chain related commands under one Docker statement;
  • Install the minimum required packages and clean up after the package manager.

We also want to be able to build this image automatically in Docker Hub. Docker Hub limits the build to one CPU core and two hours running time. Since installing all the dependencies and building SciDB takes more that two hours on a single core, we have to split the image in two images. The dockerfiles for the two images are Dockerfile.pre and Dockerfile. In the first image, we install all the dependencies, download the SciDB source code, and build a few of the SciDB components. The second image is based on the first. In it we finish building SciDB, install and setup SciDB, install Shim, and setup the image entry point.

In the following, we review the two dockerfiles. We start with the dockerfile for the first image, Dockerfile.pre, and we continue with the dockerfile for the second image, Dockerfile.

Pre-Installation Tasks

We take care of the pre-installation tasks as well as part of the build in the Dockerfile.pre file. The dockerfile starts by setting up the base image. We use Debian Linux, more exactly the Jessie (8) version. We set TERM and DEBIAN_FRONTEND environment variables in order to avoid getting warnings when installing packages (see here and here). The current dockerfile looks like this:

FROM debian:8

ARG TERM=linux
ARG DEBIAN_FRONTEND=noninteractive

Next, we set a few environment variables for SciDB, like SCIDB_VER, SCIDB_SOURCE_PATH, SCIDB_INSTALL_PATH, etc. These will be used later by the build and install scripts of SciDB:

ENV SCIDB_VER=15.12 \
    SCIDB_VER_MINOR=1.4cadab5 \
    SCIDB_SOURCE_URL="https://docs.google.com/uc?id=0B7yt0n33Us0raWtCYmNlZWRxWG8&export=download"

ENV SCIDB_SOURCE_PATH=/usr/local/src/scidb-$SCIDB_VER.$SCIDB_VER_MINOR \
    SCIDB_INSTALL_PATH=/opt/scidb/$SCIDB_VER \
    SCIDB_BUILD_TYPE=Release

ENV PATH=$PATH:$SCIDB_INSTALL_PATH/bin

We install the dependencies required to build and run SciDB next. Most of the dependencies can be installed using a single apt-get install command. Below is a snippet of the RUN statement. The full statement can be found here.

## Install dependencies
RUN apt-get update && apt-get install --assume-yes --no-install-recommends \
        apt-transport-https \
        bison \
        [...]
   && rm -rf /var/lib/apt/lists/*

Notice how we instruct apt-get not to install any recommended packages using --no-install-recommends and we clean-up after apt-get. One special dependency is the Java Development Kit (JDK). Jessie comes with JDK version 7, while SciDB requires JDK version 8 (openjdk-8-jdk). To address this, we use a special Jessie repository, the jessie-backports repository:

## Install openjdk-8-jdk from jessie-backports
## Install dependencies requiring default-jre-headless
RUN echo "deb http://http.debian.net/debian jessie-backports main" > \
        /etc/apt/sources.list.d/jessie-backports.list && \
    apt-get update && apt-get install --assume-yes --no-install-recommends \
        ant \
        ant-contrib \
        junit \
        libprotobuf-java \
        openjdk-8-jdk \
        openjdk-8-jre-headless \
    && rm -rf /var/lib/apt/lists/*

Another special case is the C++ library for communication with PostgreSQL, libpqxx. Jessie comes with version 4, while SciDB requires version 3. To address this, we build version 3 of the library from source. We use the source repository of the previous version of Debian, wheezy. We install the dependencies required to build the library from source, we build the library from source and generate a package, we uninstall the build dependencies, we install the generated package, and, finally, we clean-up the intermediary files. Below is a snippet of the RUN statement. The full statement can be found here.

## Build and install libpqxx3 from wheezy
RUN echo "deb-src http://http.debian.net/debian wheezy main" > \
        /etc/apt/sources.list.d/wheezy.list && \
    apt-get update && apt-get build-dep --assume-yes --no-install-recommends \
        libpqxx3 \
    && mkdir /usr/local/src/libpqxx3 && cd /usr/local/src/libpqxx3 && \
    apt-get source --build \
        libpqxx3 \
    && apt-get purge --assume-yes \
        autotools-dev \
        bsdmainutils \
        [...]
    && dpkg --install \
        libpqxx-3.1_3.1-1.1_amd64.deb \
        libpqxx3-dev_3.1-1.1_amd64.deb \
    && rm -rf \
        /etc/apt/sources.list.d/wheezy.list \
        /usr/local/src/libpqxx3 \
        /var/lib/apt/lists/*

The last set of dependencies are the packages provided by Paradigm4. Paradigm4 does not provide packages for Debian, instead, they provide packages for Ubuntu. Since Debian and Ubuntu use the same package management system, we can use the packages provided for Ubuntu as-is. We add the Paradigm4 repository to our list of repositories and install the required packages:

## Install Paradigm4 packages
RUN wget --no-verbose --output-document - https://downloads.paradigm4.com/key | \
        apt-key add - && \
    echo "deb https://downloads.paradigm4.com/ ubuntu14.04/3rdparty/" > \
        /etc/apt/sources.list.d/scidb.list && \
    apt-get update && apt-get install --assume-yes --no-install-recommends \
        scidb-$SCIDB_VER-ant \
        scidb-$SCIDB_VER-cityhash \
        scidb-$SCIDB_VER-libboost1.54-all-dev \
        scidb-$SCIDB_VER-libcsv \
        scidb-$SCIDB_VER-libmpich2-dev \
        scidb-$SCIDB_VER-mpich2 \
    && rm -rf /var/lib/apt/lists/*

Normally, these packages would be installed using the deploy.sh script provided with SciDB. For full control, we skip using this script and install the dependencies manually.

Building SciDB

In order to build SciDB, we first download the source code. The official SciDB source code location is on Google Drive. In order to download a file from Google Drive, we have to make two requests. The first request is to obtain some cookies and a confirmation code which are used in the second request. We extract the source code directly and skip saving the archive:

## Get SciDB source code
RUN wget --no-verbose --output-document - --load-cookies cookies.txt \
        "$SCIDB_SOURCE_URL&`wget --no-verbose --output-document - \
            --save-cookies cookies.txt "$SCIDB_SOURCE_URL" | \
                grep --only-matching 'confirm=[^&]*'`" | \
       tar --extract --gzip --directory=/usr/local/src

Next, we apply a set of patches provided by Paradigm4, also located on Google Drive:

## Apply SciDB patches
ADD https://docs.google.com/uc?id=0B8eyzr2ndWOTSFRXWHhOc1ZYTGM&export=download \
    $SCIDB_SOURCE_PATH/src/query/ops/input/ChunkLoader.h
ADD https://docs.google.com/uc?id=0B8eyzr2ndWOTakhoVjloS2l1aVE&export=download \
    $SCIDB_SOURCE_PATH/src/query/ops/input/ChunkLoader.cpp

Since SciDB is not intended to be built on Debian, we patch a few of the build scripts such that they run successfully under Debian. The full set of patches can be inspected here.

## Apply Debian 8 patches
COPY patch $SCIDB_SOURCE_PATH-patch/
RUN cd $SCIDB_SOURCE_PATH && \
cat $SCIDB_SOURCE_PATH-patch/* | patch --strip=1

Finally, we can start building SciDB. In Dockerfile.pre we only build some of the libraries:

## Build SciDB libraries (first few libs only)
RUN cd $SCIDB_SOURCE_PATH && \
    env PATH=$PATH:/opt/scidb/$SCIDB_VER/3rdparty/mpich2/bin \
        ./run.py setup --force && \
    cd stage/build && make -j2 \
        json_lib \
        MurmurHash_lib \
        util_lib \
        scidb_msg_lib \
        genmeta \
        catalog_lib \
        array_lib \
        system_lib \
        compression_lib \
        ops_lib \
        scalar_proc_lib \
        qproc_lib \
        usr_namespace_lib \
        io_lib \
        network_lib

This concludes the first dockerfile. We continue our review with the second dockerfile, Dockerfile. We base this image on the image built using Dockerfile.pre and the first command is to continue and finish building SciDB:

FROM rvernica/scidb:15.12-pre


## Build SciDB (leftover)
RUN $SCIDB_SOURCE_PATH/run.py make -j2

Next, we set some build arguments and corresponding environment variables for the SciDB installation running in this container. The build arguments specified with ARG (see Docker documentation) can be modified at build time, if required. We will also install Shim in this image and pin down a Shim version using the SHA-1 of a GitHub commit. This avoids the surprise of picking up a newer and possibly incompatible Shim version at a later time:

ARG SCIDB_INSTANCE_NUM=2
ARG SCIDB_NAME=scidb
ARG SCIDB_LOG_LEVEL=WARN

ENV SCIDB_INSTANCE_NUM=$SCIDB_INSTANCE_NUM \
    SCIDB_NAME=$SCIDB_NAME \
    SCIDB_DATA_PATH=$SCIDB_INSTALL_PATH/DB-$SCIDB_NAME

ENV SHIM_SHA1=854a4fb6c8f14e39010138ea045f0d3b431c607d \
    SHIM_VERSION=v$SCIDB_VER-20-g854a

We now setup a password-less SSH server in the container. This might look redundant, but it is required by the installation script. The script assumes the installation is made on multiple hosts at a time and logins on each of them, even if the installation is only done on the current host. Moreover, we need to modify some settings in the Linux Pluggable Authentication Module (PAM) for the SSH server so that the SSH server allows connections inside the container (see here):

## Setup SSH
RUN sed --in-place \
        's/session\s*required\s*pam_loginuid.so/session optional pam_loginuid.so/g' \
        /etc/pam.d/sshd && \
    echo 'StrictHostKeyChecking no' >> /etc/ssh/ssh_config && \
    ssh-keygen -f /root/.ssh/id_rsa -q -N "" && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys

Before installing SciDB, we also need to setup PostgreSQL. We generate a random password for the PostgreSQL root user and we save it in the .pgpass file. Once we start the SSH and PostgreSQL services we are ready to install SciDB inside the container. We use the run.py script provided with SciDB to install SciDB. As a final step we set the log level to the one configured in the environment:

## Setup PostgreSQL and SciDB
RUN echo "127.0.0.1:5432:$SCIDB_NAME:$SCIDB_NAME:`date +%s | sha256sum | base64 | head -c 32`" \
        > /root/.pgpass && \
    chmod go-r /root/.pgpass && \
    service ssh start && \
    service postgresql start && \
    echo n | $SCIDB_SOURCE_PATH/run.py install && \
    sed --in-place \
        s/log4j.rootLogger=DEBUG/log4j.rootLogger=$SCIDB_LOG_LEVEL/ \
        $SCIDB_INSTALL_PATH/share/scidb/log1.properties

Next, we install Shim. We download and unpack the source code of the pinned version from GitHub. We set the Shim version in the Makefile and run the make service command which compiles and installs Shim as a service:

## Install Shim
RUN wget --no-verbose --output-document - \
        https://github.com/Paradigm4/shim/archive/$SHIM_SHA1.tar.gz | \
        tar --extract --gzip --directory=/usr/local/src && \
    cd /usr/local/src/shim-$SHIM_SHA1 && \
    sed --in-place "s/^GIT_VERSION := .*$/GIT_VERSION := $SHIM_VERSION/" src/Makefile && \
    make service

We finalize the image by setting an ENTRYPOINT script (see Docker documentation) and exposing the SciDB and Shim ports. The entry point script is discussed in the next section.

COPY docker-entrypoint.sh /
ENTRYPOINT ["/docker-entrypoint.sh"]


## Port | App
## -----+-----
## 1239 | SciDB iquery
## 8080 | SciDB Shim (HTTP)
## 8083 | SciDB Shim (HTTPS)
EXPOSE 1239 8080 8083

Entry Point Script

An entry point script in a Docker image is executed every time the container starts. It is normally used for starting and initializing various services in the container. In our case, we use this script to start SSH, PostgreSQL, Shim, and SciDB. The docker-entrypoint.sh script looks like this:

#!/bin/bash
set -o errexit


service ssh        start
service postgresql start
service shimsvc    start


$SCIDB_INSTALL_PATH/bin/scidb.py startall $SCIDB_NAME


trap "$SCIDB_INSTALL_PATH/bin/scidb.py stopall $SCIDB_NAME; \
      service postgresql stop" \
     EXIT HUP INT QUIT TERM


if [ "$1" = '' ]
then
    tail -f $SCIDB_DATA_PATH/0/0/scidb.log
else
    exec "$@"
fi

Once we start all the services, we trap any exit or interrupt signals and stop the SciDB and PostgreSQL services when such signals are generated. This allows us to do a clean shutdown of the databases when the container is stopped. If a command is provided when the container is started ($1), for example, bash, we execute that command, otherwise, we tail the SciDB logs.

Using the Image

Using the two dockerfiles, Dockerfile.pre and Dockerfile, we can build the two images locally like this:

$ docker build --tag rvernica/scidb:15.12-pre --file Dockerfile.pre .
Sending build context to Docker daemon 34.82 kB
Step 1/16 : FROM debian:8
 ---> 1b088884749b
...
Step 16/16 : RUN cd $SCIDB_SOURCE_PATH && ...
...
 ---> f8e9e0a1fe8a
Removing intermediate container 754d00074d53
Successfully built f8e9e0a1fe8a
$ docker build --tag rvernica/scidb:15.12 .
Sending build context to Docker daemon 34.82 kB
Step 1/14 : FROM rvernica/scidb:15.12-pre
 ---> f8e9e0a1fe8a
...
Step 14/14 : EXPOSE 1239 8080 8083
 ---> Running in 0b895e473a16
 ---> 4536cbbc3a0e
Removing intermediate container 0b895e473a16
Successfully built 4536cbbc3a0e

As an alternative, we can download the already built images from Docker Hub:

$ docker pull rvernica/scidb:15.12-pre
15.12-pre: Pulling from rvernica/scidb
386a066cd84a: Pull complete
3364855bee9a: Pull complete
1d5a83062528: Pull complete
58b5c175470a: Pull complete
725863ff1c79: Pull complete
8d3cadf8ac47: Pull complete
066e3f9e305c: Pull complete
24cf7b021165: Pull complete
fa8345d54686: Pull complete
0a77a3de8243: Pull complete
b7a2f2bab106: Pull complete
Digest: sha256:439c80c3232465236c97ba0aa4880188b7c36117a377c568ab31823174d80597
Status: Downloaded newer image for rvernica/scidb:15.12-pre
$ docker pull rvernica/scidb:15.12
15.12: Pulling from rvernica/scidb
386a066cd84a: Already exists
3364855bee9a: Already exists
1d5a83062528: Already exists
58b5c175470a: Already exists
725863ff1c79: Already exists
8d3cadf8ac47: Already exists
066e3f9e305c: Already exists
24cf7b021165: Already exists
fa8345d54686: Already exists
0a77a3de8243: Already exists
b7a2f2bab106: Already exists
e1adf4ff3e84: Pull complete
0bdcfbd8cc1e: Pull complete
83e1b10fc795: Pull complete
ca9967ab1660: Pull complete
e10871d83d82: Pull complete
Digest: sha256:e90e7e6da3b912939e47269cacdbd5fcbff275a4755dd19dfe601cff95fcac50
Status: Downloaded newer image for rvernica/scidb:15.12

We can download directly scidb:15.12 without downloading scidb:15.12-pre first. Docker automatically downloads the necessary layers. Once we have the images, we can take a look at their size with docker images:

$ docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
rvernica/scidb      15.12               e259f6d496ba        7 days ago          1.89 GB
rvernica/scidb      15.12-pre           913b76a3a60c        7 days ago          1.53 GB

Note that the total space occupied on the disk is not the sum of their sizes, but the maximum. Now, can start a Docker container using:

$ docker run --tty --interactive rvernica/scidb:15.12 bash
[ ok ] Starting OpenBSD Secure Shell server: sshd.
[ ok ] Starting PostgreSQL 9.4 database server: main.
Starting shim
shim: SciDB HTTP service started on port(s) 8080,8083, with web root [/var/lib/shim/wwwroot], talking to SciDB on port 1239
scidb.py: INFO: Found 0 scidb processes
scidb.py: INFO: start((server 0 (127.0.0.1) local instance 0))
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start((server 0 (127.0.0.1) local instance 1))
scidb.py: INFO: Starting SciDB server.
root@3cb209a92b40:/# iquery --afl --query "list('libraries')"
{inst,n} name,major,minor,patch,build,build_type
{0,0} 'SciDB',15,12,1,80403125,'Release'
{1,0} 'SciDB',15,12,1,80403125,'Release'
root@3cb209a92b40:/# exit
exit

Notice how we specify bash as the command to be passed to the entry point script. The command is executed by the entry point script once it completes starting SciDB. In the Bash terminal, we can then connect to SciDB using iquery (see documentation). We can also directly start the iquery client without using a Bash terminal:

docker run --tty --interactive rvernica/scidb:15.12 iquery --afl
[ ok ] Starting OpenBSD Secure Shell server: sshd.
[ ok ] Starting PostgreSQL 9.4 database server: main.
Starting shim
shim: SciDB HTTP service started on port(s) 8080,8083, with web root [/var/lib/shim/wwwroot], talking to SciDB on port 1239
scidb.py: INFO: Found 0 scidb processes
scidb.py: INFO: start((server 0 (127.0.0.1) local instance 0))
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start((server 0 (127.0.0.1) local instance 1))
scidb.py: INFO: Starting SciDB server.
AFL%

If we don’t specify any command, the container will tail the SciDB logs and shutdown gracefully when asked to terminate:

$ docker run --tty --interactive rvernica/scidb:15.12
[ ok ] Starting OpenBSD Secure Shell server: sshd.
[ ok ] Starting PostgreSQL 9.4 database server: main.
Starting shim
shim: SciDB HTTP service started on port(s) 8080,8083, with web root [/var/lib/shim/wwwroot], talking to SciDB on port 1239
scidb.py: INFO: Found 0 scidb processes
scidb.py: INFO: start((server 0 (127.0.0.1) local instance 0))
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start((server 0 (127.0.0.1) local instance 1))
scidb.py: INFO: Starting SciDB server.
load = fn(output_array,input_file,instance_id,format,max_errors,shadow_array,isStrict){store(input(output_array,input_file,instance_id,format,max_errors,shadow_array,isStrict),output_array)};
sys_create_array_aux = fn(_A_,_E_,_C_){join(aggregate(apply(_A_,_t_,_E_),approxdc(_t_)),build(<values_per_chunk:uint64 null>[i=0:0,1,0],_C_))};
sys_create_array_att = fn(_L_,_S_,_D_){redimension(join(build(<n:int64 null,lo:int64 null,hi:int64 null,ci:int64 null,co:int64 null>[No=0:0,1,0],_S_,true),cast(aggregate(_L_,min(_D_),max(_D_),approxdc(_D_)),<min:int64 null,max:int64 null,count:int64 null>[No=0:0,1,0])),<lo:int64 null,hi:int64 null,ci:int64 null,co:int64 null,min:int64 null,max:int64 null,count:int64 null>[n=0:*,?,0])};
sys_create_array_dim = fn(_L_,_S_,_D_){redimension(join(build(<n:int64 null,lo:int64 null,hi:int64 null,ci:int64 null,co:int64 null>[No=0:0,1,0],_S_,true),cast(aggregate(apply(aggregate(_L_,count(*),_D_),_t_,_D_),min(_t_),max(_t_),count(*)),<min:int64 null,max:int64 null,count:int64 null>[No=0:0,1,0])),<lo:int64 null,hi:int64 null,ci:int64 null,co:int64 null,min:int64 null,max:int64 null,count:int64 null>[n=0:*,?,0])}
2016-11-23 21:48:37,126 [0x7f5d36edf7c0] [DEBUG]: Network manager is intialized
2016-11-23 21:48:37,126 [0x7f5d36edf7c0] [DEBUG]: NetworkManager::run()
2016-11-23 21:48:37,126 [0x7f5d36edf7c0] [DEBUG]: server-id = 0
2016-11-23 21:48:37,126 [0x7f5d36edf7c0] [DEBUG]: server-instance-id = 0
2016-11-23 21:48:37,136 [0x7f5d36edf7c0] [DEBUG]: Registered instance # 0
2016-11-23 21:48:37,136 [0x7f5d36edf7c0] [INFO ]: SciDB instance. SciDB Version: 15.12.1. Build Type: Release. Commit: 4cadab5. Copyright (C) 2008-2015 SciDB, Inc. is exiting.
^C
scidb.py: INFO: stop(server 0 (127.0.0.1))
scidb.py: INFO: checking (server 0 (127.0.0.1)) 183 184...
scidb.py: INFO: Found 2 scidb processes
scidb.py: INFO: checking (server 0 (127.0.0.1)) 183 184...
scidb.py: INFO: Found 2 scidb processes
scidb.py: INFO: checking (server 0 (127.0.0.1)) 183 184...
scidb.py: INFO: Found 2 scidb processes
scidb.py: INFO: checking (server 0 (127.0.0.1)) 183 184...
scidb.py: INFO: Found 2 scidb processes
scidb.py: INFO: Found 0 scidb processes
[ ok ] Stopping PostgreSQL 9.4 database server: main.
scidb.py: INFO: stop(server 0 (127.0.0.1))
scidb.py: INFO: Found 0 scidb processes
[ ok ] Stopping PostgreSQL 9.4 database server: main.

The dockerfiles discussed here as well as other dockerfiles, including dockerfiles for SciDB 15.7 are available here. The Docker images built using these dockerfiles are available here.