Jekyll2022-04-18T18:37:39+00:00http://rvernica.github.io/feed.xmlThinking in SciDBAdventures in the land of Multi-Dimensional ArraysRares VernicaMachine Learning in SciDB2017-10-01T00:00:00+00:002017-10-01T00:00:00+00:00http://rvernica.github.io/2017/10/streaming-machine-learning<p>Popular data processing platforms offer users the ability to inject an
external process into the data processing pipeline. The data flowing
through the data pipeline is fed as input to the external process,
while the output produced by the process is fed back into the
pipeline. The external process runs an executable or a script. This
pattern resembles the popular
<a href="https://en.wikipedia.org/wiki/Pipeline_(Unix)">Unix pipelines</a> (or
<em>pipes</em>). This feature is usually found under the name of <em>Streaming</em>.</p>
<p>In <a href="http://hadoop.apache.org/">Apache Hadoop</a>, streaming is achieved
with the
<a href="http://hadoop.apache.org/docs/r2.7.4/hadoop-streaming/HadoopStreaming.html#Hadoop_Streaming">Hadoop Streaming</a>
utility. In <a href="https://spark.apache.org/">Apache Spark</a>, streaming is
achieved with the
<a href="https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.rdd.RDD@pipe(command:Seq[String],env:scala.collection.Map[String,String],printPipeContext:(String=%3EUnit)=%3EUnit,printRDDElement:(T,String=%3EUnit)=%3EUnit,separateWorkingDir:Boolean,bufferSize:Int,encoding:String):org.apache.spark.rdd.RDD[String]">RDD.pipe</a>
function. (Not to be confused with
<a href="https://spark.apache.org/streaming/">Spark Streaming</a> which is used
for a different purpose.)</p>
<p>SciDB provides this ability through the
<a href="https://github.com/Paradigm4/stream/">stream</a> plug-in. Besides the
usual pattern of injecting an external process into the data pipeline,
the SciDB plug-in offers user-friendly and efficient interfaces for
the Python and R languages. Specifically:</p>
<ul>
<li>The external process can be a Python or R script;</li>
<li>The code for the external process does not have to be available on
the SciDB server a priori, instead, it can be sent from the client;</li>
<li>The input and output data can be in the form of Pandas DataFrames in
Python and R DataFrames in R. Internally the data is transferred
using <a href="https://arrow.apache.org/">Apache Arrow</a> for Python and
native format for R.</li>
</ul>
<p>The following diagram provides the intuition of how streaming works in
SciDB. Notice the green octagons marked with <strong>R</strong>. They represent the
external processes injected into the data pipeline. Data gets
transferred to and from them using <code class="language-plaintext highlighter-rouge">stdin</code> and <code class="language-plaintext highlighter-rouge">stdout</code>
respectively. There are as many instances of the external process as
there are SciDB instances.</p>
<p><img src="https://cloud.githubusercontent.com/assets/2708498/16286948/b4b649d2-38ad-11e6-903f-489fdc532212.png" alt="SciDB Streaming" /></p>
<p>In this post, we explore a few useful patterns of interacting with the
SciDB Stream plug-in from Python. As an example, we use the Python
<a href="http://scikit-learn.org/stable/">scikit-learn</a> machine-learning
library and build a model for the
<a href="https://www.kaggle.com/c/digit-recognizer">Digit Recognizer</a>
data-science competition offered by
<a href="https://www.kaggle.com/">Kaggle</a>. The dataset used in this
competition is the Modified National Institute of Standards and
Technology (MNIST) handwritten image dataset.</p>
<h1 id="setting-up">Setting Up</h1>
<p>First, we need to install the
<a href="https://github.com/Paradigm4/stream/tree/python">stream</a> plug-in for
SciDB. Installation
<a href="https://github.com/Paradigm4/stream/blob/python/README.md#installation">instructions</a>
are available as part of the plug-in
<a href="https://github.com/Paradigm4/stream/blob/python/README.md">README.md</a>
file. The <a href="https://arrow.apache.org/">Apache Arrow</a> library is also
installed at this point. We also use the
<a href="https://github.com/Paradigm4/accelerated_io_tools">accelerated_io_tools</a>
plug-in for loading the data into SciDB. See
<a href="https://github.com/Paradigm4/accelerated_io_tools/blob/master/README.md#installation">here</a>
for the installation instructions. (The <code class="language-plaintext highlighter-rouge">accelerated_io_tools</code> plug-in
was discussed in more detail in an <a href="/2016/05/load-data">earlier post</a>.) Finally, we install the
<a href="https://github.com/Paradigm4/limit">limit</a> plug-in for its ability
limiting the output results. A <a href="https://www.docker.com/">Docker</a> image
file of SciDB and all these plug-ins is available
<a href="https://github.com/rvernica/scidb-examples/tree/master/stream-machine-learning">here</a>.</p>
<p>Since our entire solution is in Python we need to install a number of
Python packages. Some are only necessary on the machines which run
SciDB (<em>server side</em>) while others are only necessary on the machine
from which we connect to SciDB (<em>client side</em>):</p>
<ul>
<li><a href="https://github.com/Paradigm4/SciDB-Py">SciDB-Py</a>: Python interface
to SciDB (only necessary on the client side);</li>
<li><a href="https://github.com/Paradigm4/stream/tree/python/py_pkg">SciDBStrm</a>:
Python helper package for the SciDB <code class="language-plaintext highlighter-rouge">stream</code> plug-in (necessary on
both client and server side);</li>
<li><a href="http://scikit-learn.org/stable/">scikit-learn</a>: Machine learning in
Python package (only necessary on the server side);</li>
<li><a href="https://www.scipy.org/">SciPy</a>: Fundamental library for scientific
computing in Python (only necessary on the server side).</li>
</ul>
<p>The <a href="http://www.numpy.org/">NumPy</a> and
<a href="http://pandas.pydata.org/">Pandas</a> packages are also installed as
dependencies. All of these packages can be installed using
<a href="https://packaging.python.org/tutorials/installing-packages/">pip</a>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Client side
</span><span class="n">pip</span> <span class="n">install</span> <span class="n">scidb</span><span class="o">-</span><span class="n">py</span> <span class="n">sklearn</span> <span class="n">scipy</span> <span class="n">dill</span> <span class="n">feather</span><span class="o">-</span><span class="nb">format</span> <span class="n">pandas</span>
<span class="n">pip</span> <span class="n">install</span> <span class="n">git</span><span class="o">+</span><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">github</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">paradigm4</span><span class="o">/</span><span class="n">stream</span><span class="p">.</span><span class="n">git</span><span class="o">@</span><span class="n">python</span><span class="c1">#subdirectory=py_pkg
</span>
<span class="c1"># Server side
</span><span class="n">pip</span> <span class="n">install</span> <span class="n">dill</span> <span class="n">feather</span><span class="o">-</span><span class="nb">format</span> <span class="n">pandas</span>
<span class="n">pip</span> <span class="n">install</span> <span class="n">git</span><span class="o">+</span><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">github</span><span class="p">.</span><span class="n">com</span><span class="o">/</span><span class="n">paradigm4</span><span class="o">/</span><span class="n">stream</span><span class="p">.</span><span class="n">git</span><span class="o">@</span><span class="n">python</span><span class="c1">#subdirectory=py_pkg
</span></code></pre></div></div>
<p>Next, we need to download the train and test
<a href="https://www.kaggle.com/c/digit-recognizer/data">datasets</a> offered by
Kaggle as part of the
<a href="https://www.kaggle.com/c/digit-recognizer">Digit Recognizer</a>
data-science competition. Downloading the datasets might require
creating an account on the Kaggle website. The datasets need to be
copied onto SciDB instance <code class="language-plaintext highlighter-rouge">0</code>.</p>
<h1 id="preparing-the-training-data">Preparing the Training Data</h1>
<p>We start by loading and preprocessing the training data. We load the
<code class="language-plaintext highlighter-rouge">train.csv</code> data file using the <code class="language-plaintext highlighter-rouge">aio_input</code> operator from the
<code class="language-plaintext highlighter-rouge">accelerated_io_tools</code> plug-in. In the <code class="language-plaintext highlighter-rouge">CSV</code> file, each record
contains a label for the image and the pixel color intensity of each
pixel in the image. The data format is described in detail on the
competition
<a href="https://www.kaggle.com/c/digit-recognizer/data">data page</a>. We want
to load the label in one SciDB attribute and all the pixels in a
second SciDB attribute. So, we use the <code class="language-plaintext highlighter-rouge">aio_input</code> operator to
separate the label in one attribute and leave the rest in the <code class="language-plaintext highlighter-rouge">error</code>
field of the operator output. We will parse the <code class="language-plaintext highlighter-rouge">error</code> field in the
next step. The AFL query looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl --no-fetch
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span>
<span class="s">'path=/kaggle/train.csv'</span><span class="p">,</span>
<span class="s">'num_attributes=1'</span><span class="p">,</span>
<span class="s">'attribute_delimiter=,'</span><span class="p">,</span>
<span class="s">'header=1'</span><span class="p">),</span>
<span class="n">train_csv</span><span class="p">);</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
</code></pre></div></div>
<p>The query assumes the Kaggle training data file, <code class="language-plaintext highlighter-rouge">train.csv</code>, is in
the <code class="language-plaintext highlighter-rouge">/kaggle</code> directory on the first SciDB server instance (instance
<code class="language-plaintext highlighter-rouge">0</code>). We can have a look at the resulting array using the <code class="language-plaintext highlighter-rouge">limit</code>
operator:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">limit</span><span class="p">(</span><span class="n">train_csv</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">}</span> <span class="n">a0</span><span class="p">,</span><span class="n">error</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'1'</span><span class="p">,</span><span class="s">'long,0,0,0,...'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'0'</span><span class="p">,</span><span class="s">'long,0,0,0,...'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'1'</span><span class="p">,</span><span class="s">'long,0,0,0,...'</span>
</code></pre></div></div>
<p>As intended, the first attribute <code class="language-plaintext highlighter-rouge">a0</code> contains the record label (<code class="language-plaintext highlighter-rouge">0</code>,
<code class="language-plaintext highlighter-rouge">1</code>, <code class="language-plaintext highlighter-rouge">2</code>, etc.) while the <code class="language-plaintext highlighter-rouge">error</code> attribute contains the pixel color
intensities (e.g., <code class="language-plaintext highlighter-rouge">0,0,0,...</code>, etc.).</p>
<p>Next, we convert the pixel color intensities values from text to
binary. This is the first use of the <code class="language-plaintext highlighter-rouge">stream</code> plug-in. We implement
this step entirely in Python, using the SciDB-Py library (see
<a href="http://paradigm4.github.io/SciDB-Py/">docs</a>). The code for this step
is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">scidbpy</span>
<span class="kn">import</span> <span class="nn">scidbstrm</span>
<span class="k">def</span> <span class="nf">map_to_bin</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="kn">import</span> <span class="nn">numpy</span>
<span class="n">df</span><span class="p">[</span><span class="s">'a0'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'a0'</span><span class="p">].</span><span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'error'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'error'</span><span class="p">].</span><span class="nb">map</span><span class="p">(</span>
<span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">numpy</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">','</span><span class="p">)[</span><span class="mi">1</span><span class="p">:]),</span>
<span class="n">dtype</span><span class="o">=</span><span class="n">numpy</span><span class="p">.</span><span class="n">uint8</span><span class="p">).</span><span class="n">tobytes</span><span class="p">())</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">db</span> <span class="o">=</span> <span class="n">scidbpy</span><span class="p">.</span><span class="n">connect</span><span class="p">()</span>
<span class="n">ar_fun</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="nb">input</span><span class="p">(</span><span class="n">upload_data</span><span class="o">=</span><span class="n">scidbstrm</span><span class="p">.</span><span class="n">pack_func</span><span class="p">(</span><span class="n">map_to_bin</span><span class="p">)).</span><span class="n">store</span><span class="p">()</span>
<span class="n">que</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">stream</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">train_csv</span><span class="p">,</span>
<span class="n">scidbstrm</span><span class="p">.</span><span class="n">python_map</span><span class="p">,</span>
<span class="s">"'format=feather'"</span><span class="p">,</span>
<span class="s">"'types=int64,binary'"</span><span class="p">,</span>
<span class="s">"'names=label,img'"</span><span class="p">,</span>
<span class="s">'_sg({}, 0)'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">ar_fun</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="p">).</span><span class="n">store</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">train_bin</span><span class="p">)</span>
</code></pre></div></div>
<p>The code has three parts:</p>
<ol>
<li>Declare the mapping function <code class="language-plaintext highlighter-rouge">map_to_bin</code>;</li>
<li>Upload the code of the mapping function to a temporary array in SciDB;</li>
<li>Use the <code class="language-plaintext highlighter-rouge">stream</code> operator to apply the mapping function to each
record in the <code class="language-plaintext highlighter-rouge">train_csv</code> array. Store the result in the
<code class="language-plaintext highlighter-rouge">train_bin</code> array.</li>
</ol>
<p>The mapping function takes as input a DataFrame with a chunk of
records from the <code class="language-plaintext highlighter-rouge">train_csv</code> array. The function uses the
<a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html">pandas.Series.map</a>
function, which applies a <em>map</em> function to a DataFrame column. It
first converts the label value, in the <code class="language-plaintext highlighter-rouge">a0</code> column, from string to
integer. It then converts the pixel color intensities, in the <code class="language-plaintext highlighter-rouge">error</code>
column, from string to binary. The lambda function provided parses the
pixel color intensities and stores the result into a NumPy array. The
binary representation of the NumPy array is stored back into the
DataFrame. The mapping function is serialized using the <code class="language-plaintext highlighter-rouge">pack_func</code>
function provided by the SciDBStrm library (see
<a href="https://github.com/Paradigm4/stream/tree/python/py_pkg#scidb-strm-python-api-and-examples">docs</a>)
and uploaded to a temporary array in SciDB. This array will be removed
by the SciDB-Py library when the Python interpreter exits.</p>
<p>In the <code class="language-plaintext highlighter-rouge">stream</code> operator, the mapping function is provided using the
<code class="language-plaintext highlighter-rouge">_sg</code> operator. This operator can pass a second array as an argument
to the <code class="language-plaintext highlighter-rouge">stream</code> operator. The <code class="language-plaintext highlighter-rouge">0</code> argument in the <code class="language-plaintext highlighter-rouge">_sg</code> operator
instructs SciDB to copy the array to every instance. The script to be
executed by the <code class="language-plaintext highlighter-rouge">stream</code> operator is provided in the second argument,
and, in this case, it is set to <code class="language-plaintext highlighter-rouge">scidbstrm.python_map</code>. With this
argument, a standard Python invocation is used (see
<a href="https://github.com/Paradigm4/stream/tree/python/py_pkg#scidb-strm-python-api-and-examples">docs</a>). The
code loads the mapping function provided in the second array argument
and applies it to the first array argument. The output array attribute
types and names are provided as part of the <code class="language-plaintext highlighter-rouge">stream</code> arguments as
well. In this case, the output attribute types are <code class="language-plaintext highlighter-rouge">int64</code> and
<code class="language-plaintext highlighter-rouge">binary</code>. The resulting array looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">limit</span><span class="p">(</span><span class="n">train_bin</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
<span class="p">{</span><span class="n">instance_id</span><span class="p">,</span><span class="n">chunk_no</span><span class="p">,</span><span class="n">value_no</span><span class="p">}</span> <span class="n">label</span><span class="p">,</span><span class="n">img</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">1</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">1</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
</code></pre></div></div>
<p>Each record stores the image label as integer and the pixel color
intensities as the binary representation of a NumPy. The image of a
record can be displayed in IPython using:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">matplotlib</span> <span class="n">gtk</span> <span class="c1"># replace gtk with a back-end available for your platform
</span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span>
<span class="n">plt</span> <span class="o">=</span> <span class="n">matplotlib</span><span class="p">.</span><span class="n">pyplot</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span>
<span class="n">numpy</span><span class="p">.</span><span class="n">frombuffer</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">limit</span><span class="p">(</span><span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">train_bin</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="s">'img'</span><span class="p">][</span><span class="s">'val'</span><span class="p">],</span>
<span class="n">dtype</span><span class="o">=</span><span class="n">numpy</span><span class="p">.</span><span class="n">uint8</span><span class="p">).</span><span class="n">reshape</span><span class="p">((</span><span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">)),</span>
<span class="n">cmap</span><span class="o">=</span><span class="s">'gray'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/img/posts/matplotlib-gray.jpg" alt="Matplotlib screenshot" /></p>
<p>Finally, we are going to convert the training data images from
grayscale to black and white. We do this with the help of the <code class="language-plaintext highlighter-rouge">stream</code>
operator using a similar pattern as before:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">map_to_bw</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="kn">import</span> <span class="nn">numpy</span>
<span class="k">def</span> <span class="nf">bin_to_bw</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
<span class="n">img_ar</span> <span class="o">=</span> <span class="n">numpy</span><span class="p">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">numpy</span><span class="p">.</span><span class="n">uint8</span><span class="p">).</span><span class="n">copy</span><span class="p">()</span>
<span class="n">img_ar</span><span class="p">[</span><span class="n">img_ar</span> <span class="o">></span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">img_ar</span><span class="p">.</span><span class="n">tobytes</span><span class="p">()</span>
<span class="n">df</span><span class="p">[</span><span class="s">'img'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'img'</span><span class="p">].</span><span class="nb">map</span><span class="p">(</span><span class="n">bin_to_bw</span><span class="p">)</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">que</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">iquery</span><span class="p">(</span><span class="s">"""
store(
stream(
train_bin,
{script},
'format=feather',
'types=int64,binary',
'names=label,img',
_sg(
input(
{{sch}},
'{{fn}}',
0,
'{{fmt}}'),
0)),
train_bw)"""</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">script</span><span class="o">=</span><span class="n">scidbstrm</span><span class="p">.</span><span class="n">python_map</span><span class="p">),</span>
<span class="n">upload_data</span><span class="o">=</span><span class="n">scidbstrm</span><span class="p">.</span><span class="n">pack_func</span><span class="p">(</span><span class="n">map_to_bw</span><span class="p">))</span>
</code></pre></div></div>
<p>As before, the <code class="language-plaintext highlighter-rouge">stream</code> mapping function leverages the
<code class="language-plaintext highlighter-rouge">pandas.Series.map</code> function. The function provided to
<code class="language-plaintext highlighter-rouge">pandas.Series.map</code> converts the binary value stored in the <code class="language-plaintext highlighter-rouge">img</code>
column to a NumPy array and applies the black and white
thresholding. The binary representation of the modified NumPy array is
stored back into the DataFrame.</p>
<p>For calling the <code class="language-plaintext highlighter-rouge">stream</code> operator, we are making use of the <code class="language-plaintext highlighter-rouge">iquery</code>
function. This pattern allows us more flexibility, but it is not as
user-friendly as the <code class="language-plaintext highlighter-rouge">db.stream</code> approach. Moreover, we avoid creating
a temporary array for the mapping function by combining the <code class="language-plaintext highlighter-rouge">input</code>
operator with the <code class="language-plaintext highlighter-rouge">_sg</code> operator. Avoiding to create the temporary
array could result in performance benefits. The resulting array is
very similar to the one before:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">limit</span><span class="p">(</span><span class="n">train_bw</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
<span class="p">{</span><span class="n">instance_id</span><span class="p">,</span><span class="n">chunk_no</span><span class="p">,</span><span class="n">value_no</span><span class="p">}</span> <span class="n">label</span><span class="p">,</span><span class="n">img</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">1</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">1</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
</code></pre></div></div>
<p>The image of a record can be displayed in IPython as before:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span> <span class="o">=</span> <span class="n">matplotlib</span><span class="p">.</span><span class="n">pyplot</span><span class="p">.</span><span class="n">imshow</span><span class="p">(</span>
<span class="n">numpy</span><span class="p">.</span><span class="n">frombuffer</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">limit</span><span class="p">(</span><span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">train_bw</span><span class="p">,</span> <span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="s">'img'</span><span class="p">][</span><span class="s">'val'</span><span class="p">],</span>
<span class="n">dtype</span><span class="o">=</span><span class="n">numpy</span><span class="p">.</span><span class="n">uint8</span><span class="p">).</span><span class="n">reshape</span><span class="p">((</span><span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">)),</span>
<span class="n">cmap</span><span class="o">=</span><span class="s">'gray'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/img/posts/matplotlib-bw.jpg" alt="Matplotlib screenshot" /></p>
<h1 id="training-a-model">Training a Model</h1>
<p>We now train a model on the data we just uploaded. We train the model
in SciDB using the <code class="language-plaintext highlighter-rouge">stream</code> plug-in. Since SciDB is a distributed
database, we first train multiple models in parallel on each SciDB
instance and then we merge all the models into a single global
model. For the instance models, we use the
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html">stochastic gradient descent</a>
classifier available in the scikit-learn library. One reason we choose
this model is its ability to train on partial data. This is useful as
the data is streamed one chunk at a time and not available all at
once. So, we invoke the classifier’s <code class="language-plaintext highlighter-rouge">partial_fit</code> (see
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier.partial_fit">docs</a>)
function for each chunk. The Python code for training the instance
models is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Train</span><span class="p">:</span>
<span class="n">model</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">map</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'img'</span><span class="p">].</span><span class="nb">map</span><span class="p">(</span>
<span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">numpy</span><span class="p">.</span><span class="n">frombuffer</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">numpy</span><span class="p">.</span><span class="n">uint8</span><span class="p">))</span>
<span class="n">Train</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">partial_fit</span><span class="p">(</span><span class="n">numpy</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">img</span><span class="p">.</span><span class="n">tolist</span><span class="p">()),</span>
<span class="n">df</span><span class="p">[</span><span class="s">'label'</span><span class="p">],</span>
<span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
<span class="n">Train</span><span class="p">.</span><span class="n">count</span> <span class="o">+=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">finalize</span><span class="p">():</span>
<span class="k">if</span> <span class="n">Train</span><span class="p">.</span><span class="n">count</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="n">buf</span> <span class="o">=</span> <span class="n">io</span><span class="p">.</span><span class="n">BytesIO</span><span class="p">()</span>
<span class="n">sklearn</span><span class="p">.</span><span class="n">externals</span><span class="p">.</span><span class="n">joblib</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">Train</span><span class="p">.</span><span class="n">model</span><span class="p">,</span> <span class="n">buf</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pandas</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span>
<span class="s">'count'</span><span class="p">:</span> <span class="p">[</span><span class="n">Train</span><span class="p">.</span><span class="n">count</span><span class="p">],</span>
<span class="s">'model'</span><span class="p">:</span> <span class="p">[</span><span class="n">buf</span><span class="p">.</span><span class="n">getvalue</span><span class="p">()]})</span>
<span class="n">ar_fun</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="nb">input</span><span class="p">(</span><span class="n">upload_data</span><span class="o">=</span><span class="n">scidbstrm</span><span class="p">.</span><span class="n">pack_func</span><span class="p">(</span><span class="n">Train</span><span class="p">)).</span><span class="n">store</span><span class="p">()</span>
<span class="n">python_run</span> <span class="o">=</span> <span class="s">"""'python -uc "
import io
import numpy
import pandas
import scidbstrm
import sklearn.externals
import sklearn.linear_model
Train = scidbstrm.read_func()
Train.model = sklearn.linear_model.SGDClassifier()
scidbstrm.map(Train.map, Train.finalize)
"'"""</span>
<span class="n">que</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">stream</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">train_bin</span><span class="p">,</span>
<span class="n">python_run</span><span class="p">,</span>
<span class="s">"'format=feather'"</span><span class="p">,</span>
<span class="s">"'types=int64,binary'"</span><span class="p">,</span>
<span class="s">"'names=count,model'"</span><span class="p">,</span>
<span class="s">'_sg({}, 0)'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">ar_fun</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="p">).</span><span class="n">store</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">model</span><span class="p">)</span>
</code></pre></div></div>
<p>The code structure is similar as before, except that instead of a
mapping function, we provide a “mapping” class. We do this for two
reasons:</p>
<ol>
<li>We want to provide both a <em>map</em> and a <em>finalize</em> function. The map
function is called for each chunk. The finalize function is called
once all the chunks have been processed.</li>
<li>We want to have two static class variables to keep track of the
model and a count of how many training records were used. (This
could have been global variables in the script we provide to
<code class="language-plaintext highlighter-rouge">stream</code>, but it is more intuitive to have them as class
variables.)</li>
</ol>
<p>The <code class="language-plaintext highlighter-rouge">map</code> function uses the <code class="language-plaintext highlighter-rouge">pandas.Series.map</code> function to convert
the values in the <code class="language-plaintext highlighter-rouge">img</code> column from binary to NumPy. We then use the
NumPy array (reshaped as a matrix) and the <code class="language-plaintext highlighter-rouge">label</code> column to train a
partial model. We also keep track of how many records we used to
train. The map function does not return any data back to SciDB.</p>
<p>The <code class="language-plaintext highlighter-rouge">finalize</code> function first checks whether any training data was fed
into the model. Remember that this code runs independently on each
SciDB instance and some instances might have no training data. If
training data was provided, the model is serialized using the <code class="language-plaintext highlighter-rouge">dump</code>
function (see
<a href="https://pythonhosted.org/joblib/generated/joblib.dump.html#joblib.dump">docs</a>)
and returned to SciDB along with the training records count.</p>
<p>For this step, we provide a customized Python script to the <code class="language-plaintext highlighter-rouge">stream</code>
operator. The script is stored in the <code class="language-plaintext highlighter-rouge">python_run</code> variable and it
invokes the Python interpreter the Python code for this step as a
command-line argument. After importing the required modules, we
deserialize the <code class="language-plaintext highlighter-rouge">Train</code> class (provided to the <code class="language-plaintext highlighter-rouge">stream</code> operator as
the second argument using the <code class="language-plaintext highlighter-rouge">_sg</code> operator) using the
<code class="language-plaintext highlighter-rouge">scidbstrm.read_func</code> function (see
<a href="https://github.com/Paradigm4/stream/tree/python/py_pkg#scidb-strm-python-api-and-examples">docs</a>). We
then initialize our model and apply the <code class="language-plaintext highlighter-rouge">map</code> and <code class="language-plaintext highlighter-rouge">finalize</code> functions
on the streamed data using the <code class="language-plaintext highlighter-rouge">scidbstrm.map</code> function (see
<a href="https://github.com/Paradigm4/stream/tree/python/py_pkg#scidb-strm-python-api-and-examples">docs</a>). The
resulting array looks is like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">scan</span><span class="p">(</span><span class="n">model</span><span class="p">);</span>
<span class="p">{</span><span class="n">instance_id</span><span class="p">,</span><span class="n">chunk_no</span><span class="p">,</span><span class="n">value_no</span><span class="p">}</span> <span class="n">count</span><span class="p">,</span><span class="n">model</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">22949</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">19051</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
</code></pre></div></div>
<p>Our setup consists of two SciDB instances and each had approximately
<code class="language-plaintext highlighter-rouge">50%</code> of the training data. As a result, two models have been trained
in parallel, one on each instance.</p>
<p>Next, we merge the instance models in a single global model. For the
global model, we choose the
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html">voting</a>
classifier for its ability to combine multiple trained models. The
Python code for this step is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">def</span> <span class="nf">merge_models</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="kn">import</span> <span class="nn">io</span>
<span class="kn">import</span> <span class="nn">pandas</span>
<span class="kn">import</span> <span class="nn">sklearn.ensemble</span>
<span class="kn">import</span> <span class="nn">sklearn.externals</span>
<span class="n">estimators</span> <span class="o">=</span> <span class="p">[</span><span class="n">sklearn</span><span class="p">.</span><span class="n">externals</span><span class="p">.</span><span class="n">joblib</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">io</span><span class="p">.</span><span class="n">BytesIO</span><span class="p">(</span><span class="n">byt</span><span class="p">))</span>
<span class="k">for</span> <span class="n">byt</span> <span class="ow">in</span> <span class="n">df</span><span class="p">[</span><span class="s">'model'</span><span class="p">]]</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">estimators</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="n">labelencoder</span> <span class="o">=</span> <span class="n">sklearn</span><span class="p">.</span><span class="n">preprocessing</span><span class="p">.</span><span class="n">LabelEncoder</span><span class="p">()</span>
<span class="n">labelencoder</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">))</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">sklearn</span><span class="p">.</span><span class="n">ensemble</span><span class="p">.</span><span class="n">VotingClassifier</span><span class="p">(())</span>
<span class="n">model</span><span class="p">.</span><span class="n">estimators_</span> <span class="o">=</span> <span class="n">estimators</span>
<span class="n">model</span><span class="p">.</span><span class="n">le_</span> <span class="o">=</span> <span class="n">labelencoder</span>
<span class="n">buf</span> <span class="o">=</span> <span class="n">io</span><span class="p">.</span><span class="n">BytesIO</span><span class="p">()</span>
<span class="n">sklearn</span><span class="p">.</span><span class="n">externals</span><span class="p">.</span><span class="n">joblib</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">model</span><span class="p">,</span> <span class="n">buf</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pandas</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">'count'</span><span class="p">:</span> <span class="n">df</span><span class="p">.</span><span class="nb">sum</span><span class="p">()[</span><span class="s">'count'</span><span class="p">],</span>
<span class="s">'model'</span><span class="p">:</span> <span class="p">[</span><span class="n">buf</span><span class="p">.</span><span class="n">getvalue</span><span class="p">()]})</span>
<span class="n">ar_fun</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="nb">input</span><span class="p">(</span><span class="n">upload_data</span><span class="o">=</span><span class="n">scidbstrm</span><span class="p">.</span><span class="n">pack_func</span><span class="p">(</span><span class="n">merge_models</span><span class="p">)).</span><span class="n">store</span><span class="p">()</span>
<span class="n">que</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">redimension</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
<span class="s">'<count:int64, model:binary> [i]'</span>
<span class="p">).</span><span class="n">stream</span><span class="p">(</span>
<span class="n">scidbstrm</span><span class="p">.</span><span class="n">python_map</span><span class="p">,</span>
<span class="s">"'format=feather'"</span><span class="p">,</span>
<span class="s">"'types=int64,binary'"</span><span class="p">,</span>
<span class="s">"'names=count,model'"</span><span class="p">,</span>
<span class="s">'_sg({}, 0)'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">ar_fun</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="p">).</span><span class="n">store</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">model_final</span><span class="p">)</span>
</code></pre></div></div>
<p>We use the standard pattern of providing a mapping function, called
<code class="language-plaintext highlighter-rouge">merge_models</code>. In this function, we first deserialize the instance
models. We also prepare a label encoder with <code class="language-plaintext highlighter-rouge">10</code> labels, one for
each digit. Once we have this, we can instantiate the voting
classifier and set its estimators and label encoder. The classifier is
then serialized and returned to SciDB along with a total count of the
training records.</p>
<p>Before calling the <code class="language-plaintext highlighter-rouge">stream</code> operator, we need to make sure that all
the instance models are at one SciDB instance, in one chunk. This is
done using the <code class="language-plaintext highlighter-rouge">redimension</code> operator (see
<a href="https://paradigm4.atlassian.net/wiki/spaces/ESD169/pages/50856222/redimension">docs</a>). Getting
all the instance models at one instance in one chunk does not present
a scalability challenge since the number of models equals the number
of SciDB instances. The resulting array contains one trained model and
a training record count equal to the size of the training data:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">scan</span><span class="p">(</span><span class="n">model_final</span><span class="p">);</span>
<span class="p">{</span><span class="n">instance_id</span><span class="p">,</span><span class="n">chunk_no</span><span class="p">,</span><span class="n">value_no</span><span class="p">}</span> <span class="n">count</span><span class="p">,</span><span class="n">model</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">42000</span><span class="p">,</span><span class="o"><</span><span class="n">binary</span><span class="o">></span>
</code></pre></div></div>
<h1 id="making-predictions">Making Predictions</h1>
<p>Now that we have a trained model, we are ready to make predictions on
the test data. We start by loading the test data in SciDB:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl --no-fetch
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="nb">input</span><span class="p">(</span>
<span class="o"><</span><span class="n">img</span><span class="p">:</span><span class="n">string</span><span class="o">></span><span class="p">[</span><span class="n">ImageID</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="o">*</span><span class="p">],</span>
<span class="s">'/kaggle/test.csv'</span><span class="p">,</span>
<span class="mi">0</span><span class="p">,</span>
<span class="s">'csv:lt'</span><span class="p">),</span>
<span class="n">test_csv</span><span class="p">);</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
</code></pre></div></div>
<p>The query assumes the Kaggle test data file, <code class="language-plaintext highlighter-rouge">test.csv</code>, is in the
<code class="language-plaintext highlighter-rouge">/kaggle</code> directory on the first SciDB server instance (instance
<code class="language-plaintext highlighter-rouge">0</code>). We use the <code class="language-plaintext highlighter-rouge">input</code> operator (see
<a href="https://paradigm4.atlassian.net/wiki/spaces/ESD169/pages/50856232/input">docs</a>)
as opposed to the <code class="language-plaintext highlighter-rouge">aio_input</code> operator we used earlier because <code class="language-plaintext highlighter-rouge">input</code>
assigns a sequential number (captured by <code class="language-plaintext highlighter-rouge">ImageID</code>) to each line of
the input file. We need this because there are no explicit image IDs
in the data file. Using <code class="language-plaintext highlighter-rouge">t</code> in the <code class="language-plaintext highlighter-rouge">csv:lt</code> format specifier, we
instruct SciDB to look for <em>TAB</em> as the fild delimiter. Since our
actual field delimiter is the <em>comma</em>, we force all the pixel color
intensities into a single attribute, <code class="language-plaintext highlighter-rouge">img</code>. The resulting array looks
like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">limit</span><span class="p">(</span><span class="n">test_csv</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
<span class="p">{</span><span class="n">ImageID</span><span class="p">}</span> <span class="n">img</span>
<span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="s">'0,0,0,...'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">}</span> <span class="s">'0,0,0,...'</span>
<span class="p">{</span><span class="mi">3</span><span class="p">}</span> <span class="s">'0,0,0,...'</span>
</code></pre></div></div>
<p>We use the <code class="language-plaintext highlighter-rouge">stream</code> operator and the model already stored in SciDB to
make predictions. The Python code for this step is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">Predict</span><span class="p">:</span>
<span class="n">model</span> <span class="o">=</span> <span class="bp">None</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">csv_to_bw</span><span class="p">(</span><span class="n">csv</span><span class="p">):</span>
<span class="n">img_ar</span> <span class="o">=</span> <span class="n">numpy</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">csv</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">','</span><span class="p">)),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">numpy</span><span class="p">.</span><span class="n">uint8</span><span class="p">)</span>
<span class="n">img_ar</span><span class="p">[</span><span class="n">img_ar</span> <span class="o">></span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">img_ar</span>
<span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">map</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
<span class="n">img</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'img'</span><span class="p">].</span><span class="nb">map</span><span class="p">(</span><span class="n">Predict</span><span class="p">.</span><span class="n">csv_to_bw</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'img'</span><span class="p">]</span> <span class="o">=</span> <span class="n">Predict</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">numpy</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">img</span><span class="p">.</span><span class="n">tolist</span><span class="p">()))</span>
<span class="k">return</span> <span class="n">df</span>
<span class="n">ar_fun</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="nb">input</span><span class="p">(</span>
<span class="n">upload_data</span><span class="o">=</span><span class="n">scidbstrm</span><span class="p">.</span><span class="n">pack_func</span><span class="p">(</span><span class="n">Predict</span><span class="p">)</span>
<span class="p">).</span><span class="n">cross_join</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">model_final</span>
<span class="p">).</span><span class="n">store</span><span class="p">()</span>
<span class="n">python_run</span> <span class="o">=</span> <span class="s">"""'python -uc "
import dill
import io
import numpy
import scidbstrm
import sklearn.externals
df = scidbstrm.read()
Predict = dill.loads(df.iloc[0, 0])
Predict.model = sklearn.externals.joblib.load(io.BytesIO(df.iloc[0, 2]))
scidbstrm.write()
scidbstrm.map(Predict.map)
"'"""</span>
<span class="n">que</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="nb">apply</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">test_csv</span><span class="p">,</span>
<span class="s">'ImageID'</span><span class="p">,</span>
<span class="s">'ImageID'</span>
<span class="p">).</span><span class="n">stream</span><span class="p">(</span>
<span class="n">python_run</span><span class="p">,</span>
<span class="s">"'format=feather'"</span><span class="p">,</span>
<span class="s">"'types=int64,int64'"</span><span class="p">,</span>
<span class="s">"'names=Label,ImageID'"</span><span class="p">,</span>
<span class="s">'_sg({}, 0)'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">ar_fun</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
<span class="p">).</span><span class="n">store</span><span class="p">(</span>
<span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">predict_test</span><span class="p">)</span>
</code></pre></div></div>
<p>We define a “mapping” class in order to have a static class variable
for storing the trained model (instantiated during streaming). The
class provides a transformation function (<code class="language-plaintext highlighter-rouge">csv_to_bw</code>) to convert a
test record from text format to NumPy and to apply the black and white
thresholding. (These transformations are identical with the one used
for the training data earlier.) The actual map function first applies
the transformation function for each record of the input
DataFrame. Next, it predicts the labels for each of the images in the
input DataFrame (after first reshaping the input DataFrame to a
matrix). The labeled data is returned to SciDB.</p>
<p>In order to have both the “mapping” class we just defined and the
model trained earlier available in the <code class="language-plaintext highlighter-rouge">stream</code> operator, we store
both of them, as two attributes in a record in a temporary array. This
temporary array is fed into the <code class="language-plaintext highlighter-rouge">stream</code> operator as the second
argument array (using the <code class="language-plaintext highlighter-rouge">_sg</code> operator).</p>
<p>The Python script to be used by the <code class="language-plaintext highlighter-rouge">stream</code> operator is provided
explicitly and stored in the <code class="language-plaintext highlighter-rouge">python_run</code> variable. The script starts
by reading the first available chunk. The <code class="language-plaintext highlighter-rouge">read</code> function is part of
the low-level <code class="language-plaintext highlighter-rouge">SciDBStrm</code> API which allows us to read chunks of data
from SciDB (see
<a href="https://github.com/Paradigm4/stream/tree/python/py_pkg#scidb-strm-python-api-and-examples">docs</a>). This
first chunk is the chunk containing the second argument array we are
providing using the <code class="language-plaintext highlighter-rouge">_sg</code> operator. From this chunk, we extract the
“mapping” class and store it in <code class="language-plaintext highlighter-rouge">Predict</code>. The “mapping” class is the
first attribute, of the first record, of the chunk, and we use
<code class="language-plaintext highlighter-rouge">iloc[0, 0]</code> to extract it. We also extract the trained model and
store it in <code class="language-plaintext highlighter-rouge">Predict.model</code>. The model is the third attribute, after
the <code class="language-plaintext highlighter-rouge">cross_join</code>, and we use <code class="language-plaintext highlighter-rouge">iloc[0, 2]</code> to extract it. Since we are
using the low-level streaming API we need to make an explicit call to
<code class="language-plaintext highlighter-rouge">write</code> (this tells SciDB that we have no output for this first chunk
of data). The script then proceeds by applying the <code class="language-plaintext highlighter-rouge">Predict.map</code>
function to the streamed data. Once the <code class="language-plaintext highlighter-rouge">stream</code> operator is executed,
the resulting array looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">limit</span><span class="p">(</span><span class="n">predict_test</span><span class="p">,</span> <span class="mi">3</span><span class="p">);</span>
<span class="p">{</span><span class="n">instance_id</span><span class="p">,</span><span class="n">chunk_no</span><span class="p">,</span><span class="n">value_no</span><span class="p">}</span> <span class="n">Label</span><span class="p">,</span><span class="n">ImageID</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">2</span><span class="p">,</span><span class="mi">1</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">2</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">9</span><span class="p">,</span><span class="mi">3</span>
</code></pre></div></div>
<p>In our final step, we save the array containing the predictions on the
client in a format consistent with the one expected by Kaggle. This
can be done in Python using:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">db</span><span class="p">.</span><span class="n">arrays</span><span class="p">.</span><span class="n">predict_test</span><span class="p">.</span><span class="n">fetch</span><span class="p">(</span><span class="n">as_dataframe</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'ImageID'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'ImageID'</span><span class="p">].</span><span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s">'Label'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'Label'</span><span class="p">].</span><span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">'results.csv'</span><span class="p">,</span>
<span class="n">header</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">columns</span><span class="o">=</span><span class="p">(</span><span class="s">'ImageID'</span><span class="p">,</span> <span class="s">'Label'</span><span class="p">))</span>
</code></pre></div></div>
<p>The resulting file, <code class="language-plaintext highlighter-rouge">results.csv</code>, can be directly uploaded to Kaggle
for scoring. We also made predictions for the training data. We
plotted the true labels and the predicted labels on a scatter plot. We
randomly jittered each point so that they do not completely overlap.</p>
<p><img src="/assets/img/posts/predict-result.jpg" alt="True vs. predicted labels" /></p>
<p>As we can see, most of the points are on the main diagonal, meaning
that most of the data is labeled correctly. Moreover, there is a
visible hot spot in the <code class="language-plaintext highlighter-rouge">(9, 4)</code> area, meaning that a lot of images of
<code class="language-plaintext highlighter-rouge">9</code> are labeled as <code class="language-plaintext highlighter-rouge">4</code> by our model. The code for generating this plot
is included in the
<a href="https://github.com/Paradigm4/stream/blob/python/py_pkg/examples/4-machine-learning.py">full script</a></p>
<h1 id="summary">Summary</h1>
<p>To summarize, in this post we looked at how we can use the <code class="language-plaintext highlighter-rouge">stream</code>
operator to train machine-learning models in parallel in SciDB. As an
example, we used the Python scikit-learn machine-learning library and
trained a model for a data-science competition on Kaggle. The patterns
we discussed were:</p>
<ul>
<li>Use the <code class="language-plaintext highlighter-rouge">stream</code> operator with a mapping function;</li>
<li>Use the <code class="language-plaintext highlighter-rouge">stream</code> operator with a “mapping” class;</li>
<li>Provide a custom script to the <code class="language-plaintext highlighter-rouge">stream</code> operator;</li>
<li>Provide multiple values as second array arguments to the <code class="language-plaintext highlighter-rouge">stream</code> operator;</li>
<li>Store and retrieve NumPy arrays as binary values in SciDB;</li>
<li>Store and retrieve serialized machine-learning models in SciDB.</li>
</ul>
<p>The code discussed here is available as a
<a href="https://github.com/Paradigm4/stream/blob/python/py_pkg/examples/4-machine-learning.py">single script</a>
as part of the SciDBStrm Python package. A
<a href="https://www.docker.com/">Docker</a> image file of SciDB and required
plug-ins for running this script is available
<a href="https://github.com/rvernica/scidb-examples/tree/master/stream-machine-learning">here</a>.</p>Rares VernicaPopular data processing platforms offer users the ability to inject an external process into the data processing pipeline. The data flowing through the data pipeline is fed as input to the external process, while the output produced by the process is fed back into the pipeline. The external process runs an executable or a script. This pattern resembles the popular Unix pipelines (or pipes). This feature is usually found under the name of Streaming.Debian based Docker container for SciDB2016-11-01T00:00:00+00:002016-11-01T00:00:00+00:00http://rvernica.github.io/2016/11/docker-debian<p>In an <a href="/2016/06/docker-image">earlier post</a>, we looked at how to create a Docker image for SciDB. The image built in that post followed the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/SciDB+Community+Edition+Installation+Guide">SciDB Community Edition Installation Guide</a> very closely. The image is functional and a good learning resource, but not very efficient. The image uses around <code class="language-plaintext highlighter-rouge">6GB</code> of space and cannot be build automatically on <a href="https://hub.docker.com/">Docker Hub</a> due to long build time. In this post, we revisit this topic and try to build a more efficient Docker image for SciDB. The source files for the image is available on GitHub <a href="https://github.com/rvernica/docker-library/tree/master/scidb">here</a>. The image is available on Docker Hub <a href="https://hub.docker.com/r/rvernica/scidb/">here</a>.</p>
<p>Note: The Docker image described in this post, is for SciDB <code class="language-plaintext highlighter-rouge">15.12</code> and for a single node installation. The GitHub and Docker Hub repositories also contain images for SciDB <code class="language-plaintext highlighter-rouge">15.7</code>.</p>
<h1 id="docker-image-considerations">Docker Image Considerations</h1>
<p>To have a smaller, more space efficient, Docker image we do the following:</p>
<ul>
<li>Start from a small base image. In our case, we replace Ubuntu with Debian;</li>
<li>Chain related commands under one Docker statement;</li>
<li>Install the minimum required packages and clean up after the package manager.</li>
</ul>
<p>We also want to be able to build this image automatically in <a href="https://hub.docker.com/r/rvernica/scidb/">Docker Hub</a>. Docker Hub limits the build to one CPU core and two hours running time. Since installing all the dependencies and building SciDB takes more that two hours on a single core, we have to split the image in two images. The <em>dockerfiles</em> for the two images are <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile.pre"><code class="language-plaintext highlighter-rouge">Dockerfile.pre</code></a> and <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></a>. In the first image, we install all the dependencies, download the SciDB source code, and build a few of the SciDB components. The second image is based on the first. In it we finish building SciDB, install and setup SciDB, install <a href="https://github.com/Paradigm4/shim">Shim</a>, and setup the image entry point.</p>
<p>In the following, we review the two dockerfiles. We start with the dockerfile for the first image, <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile.pre"><code class="language-plaintext highlighter-rouge">Dockerfile.pre</code></a>, and we continue with the dockerfile for the second image, <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></a>.</p>
<h1 id="pre-installation-tasks">Pre-Installation Tasks</h1>
<p>We take care of the pre-installation tasks as well as part of the build in the <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile.pre"><code class="language-plaintext highlighter-rouge">Dockerfile.pre</code></a> file. The dockerfile starts by setting up the base image. We use <a href="https://www.debian.org/">Debian</a> Linux, more exactly the <em>Jessie</em> (<code class="language-plaintext highlighter-rouge">8</code>) version. We set <code class="language-plaintext highlighter-rouge">TERM</code> and <code class="language-plaintext highlighter-rouge">DEBIAN_FRONTEND</code> environment variables in order to avoid getting warnings when installing packages (see <a href="https://github.com/docker/docker/issues/4032">here</a> and <a href="https://github.com/phusion/baseimage-docker/issues/58">here</a>). The current dockerfile looks like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FROM debian:8
ARG <span class="nv">TERM</span><span class="o">=</span>linux
ARG <span class="nv">DEBIAN_FRONTEND</span><span class="o">=</span>noninteractive
</code></pre></div></div>
<p>Next, we set a few environment variables for SciDB, like <code class="language-plaintext highlighter-rouge">SCIDB_VER</code>, <code class="language-plaintext highlighter-rouge">SCIDB_SOURCE_PATH</code>, <code class="language-plaintext highlighter-rouge">SCIDB_INSTALL_PATH</code>, etc. These will be used later by the build and install scripts of SciDB:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ENV <span class="nv">SCIDB_VER</span><span class="o">=</span>15.12 <span class="se">\</span>
<span class="nv">SCIDB_VER_MINOR</span><span class="o">=</span>1.4cadab5 <span class="se">\</span>
<span class="nv">SCIDB_SOURCE_URL</span><span class="o">=</span><span class="s2">"https://docs.google.com/uc?id=0B7yt0n33Us0raWtCYmNlZWRxWG8&export=download"</span>
ENV <span class="nv">SCIDB_SOURCE_PATH</span><span class="o">=</span>/usr/local/src/scidb-<span class="nv">$SCIDB_VER</span>.<span class="nv">$SCIDB_VER_MINOR</span> <span class="se">\</span>
<span class="nv">SCIDB_INSTALL_PATH</span><span class="o">=</span>/opt/scidb/<span class="nv">$SCIDB_VER</span> <span class="se">\</span>
<span class="nv">SCIDB_BUILD_TYPE</span><span class="o">=</span>Release
ENV <span class="nv">PATH</span><span class="o">=</span><span class="nv">$PATH</span>:<span class="nv">$SCIDB_INSTALL_PATH</span>/bin
</code></pre></div></div>
<p>We install the dependencies required to build and run SciDB next. Most of the dependencies can be installed using a single <code class="language-plaintext highlighter-rouge">apt-get install</code> command. Below is a snippet of the <code class="language-plaintext highlighter-rouge">RUN</code> statement. The full statement can be found <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile.pre#L17">here</a>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Install dependencies</span>
RUN apt-get update <span class="o">&&</span> apt-get <span class="nb">install</span> <span class="nt">--assume-yes</span> <span class="nt">--no-install-recommends</span> <span class="se">\</span>
apt-transport-https <span class="se">\</span>
bison <span class="se">\</span>
<span class="o">[</span>...]
<span class="o">&&</span> <span class="nb">rm</span> <span class="nt">-rf</span> /var/lib/apt/lists/<span class="k">*</span>
</code></pre></div></div>
<p>Notice how we instruct <code class="language-plaintext highlighter-rouge">apt-get</code> not to install any recommended packages using <code class="language-plaintext highlighter-rouge">--no-install-recommends</code> and we clean-up after <code class="language-plaintext highlighter-rouge">apt-get</code>. One special dependency is the Java Development Kit (JDK). <em>Jessie</em> comes with JDK version <code class="language-plaintext highlighter-rouge">7</code>, while SciDB requires JDK version <code class="language-plaintext highlighter-rouge">8</code> (<code class="language-plaintext highlighter-rouge">openjdk-8-jdk</code>). To address this, we use a special <em>Jessie</em> repository, the <code class="language-plaintext highlighter-rouge">jessie-backports</code> repository:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Install openjdk-8-jdk from jessie-backports</span>
<span class="c">## Install dependencies requiring default-jre-headless</span>
RUN <span class="nb">echo</span> <span class="s2">"deb http://http.debian.net/debian jessie-backports main"</span> <span class="o">></span> <span class="se">\</span>
/etc/apt/sources.list.d/jessie-backports.list <span class="o">&&</span> <span class="se">\</span>
apt-get update <span class="o">&&</span> apt-get <span class="nb">install</span> <span class="nt">--assume-yes</span> <span class="nt">--no-install-recommends</span> <span class="se">\</span>
ant <span class="se">\</span>
ant-contrib <span class="se">\</span>
junit <span class="se">\</span>
libprotobuf-java <span class="se">\</span>
openjdk-8-jdk <span class="se">\</span>
openjdk-8-jre-headless <span class="se">\</span>
<span class="o">&&</span> <span class="nb">rm</span> <span class="nt">-rf</span> /var/lib/apt/lists/<span class="k">*</span>
</code></pre></div></div>
<p>Another special case is the C++ library for communication with PostgreSQL, <code class="language-plaintext highlighter-rouge">libpqxx</code>. <em>Jessie</em> comes with version <code class="language-plaintext highlighter-rouge">4</code>, while SciDB requires version <code class="language-plaintext highlighter-rouge">3</code>. To address this, we build version <code class="language-plaintext highlighter-rouge">3</code> of the library from source. We use the source repository of the previous version of Debian, <code class="language-plaintext highlighter-rouge">wheezy</code>. We install the dependencies required to build the library from source, we build the library from source and generate a package, we uninstall the build dependencies, we install the generated package, and, finally, we clean-up the intermediary files. Below is a snippet of the <code class="language-plaintext highlighter-rouge">RUN</code> statement. The full statement can be found <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile.pre#L61">here</a>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Build and install libpqxx3 from wheezy</span>
RUN <span class="nb">echo</span> <span class="s2">"deb-src http://http.debian.net/debian wheezy main"</span> <span class="o">></span> <span class="se">\</span>
/etc/apt/sources.list.d/wheezy.list <span class="o">&&</span> <span class="se">\</span>
apt-get update <span class="o">&&</span> apt-get build-dep <span class="nt">--assume-yes</span> <span class="nt">--no-install-recommends</span> <span class="se">\</span>
libpqxx3 <span class="se">\</span>
<span class="o">&&</span> <span class="nb">mkdir</span> /usr/local/src/libpqxx3 <span class="o">&&</span> <span class="nb">cd</span> /usr/local/src/libpqxx3 <span class="o">&&</span> <span class="se">\</span>
apt-get <span class="nb">source</span> <span class="nt">--build</span> <span class="se">\</span>
libpqxx3 <span class="se">\</span>
<span class="o">&&</span> apt-get purge <span class="nt">--assume-yes</span> <span class="se">\</span>
autotools-dev <span class="se">\</span>
bsdmainutils <span class="se">\</span>
<span class="o">[</span>...]
<span class="o">&&</span> dpkg <span class="nt">--install</span> <span class="se">\</span>
libpqxx-3.1_3.1-1.1_amd64.deb <span class="se">\</span>
libpqxx3-dev_3.1-1.1_amd64.deb <span class="se">\</span>
<span class="o">&&</span> <span class="nb">rm</span> <span class="nt">-rf</span> <span class="se">\</span>
/etc/apt/sources.list.d/wheezy.list <span class="se">\</span>
/usr/local/src/libpqxx3 <span class="se">\</span>
/var/lib/apt/lists/<span class="k">*</span>
</code></pre></div></div>
<p>The last set of dependencies are the packages provided by Paradigm4. Paradigm4 does not provide packages for Debian, instead, they provide packages for Ubuntu. Since Debian and Ubuntu use the same package management system, we can use the packages provided for Ubuntu as-is. We add the Paradigm4 repository to our list of repositories and install the required packages:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Install Paradigm4 packages</span>
RUN wget <span class="nt">--no-verbose</span> <span class="nt">--output-document</span> - https://downloads.paradigm4.com/key | <span class="se">\</span>
apt-key add - <span class="o">&&</span> <span class="se">\</span>
<span class="nb">echo</span> <span class="s2">"deb https://downloads.paradigm4.com/ ubuntu14.04/3rdparty/"</span> <span class="o">></span> <span class="se">\</span>
/etc/apt/sources.list.d/scidb.list <span class="o">&&</span> <span class="se">\</span>
apt-get update <span class="o">&&</span> apt-get <span class="nb">install</span> <span class="nt">--assume-yes</span> <span class="nt">--no-install-recommends</span> <span class="se">\</span>
scidb-<span class="nv">$SCIDB_VER</span><span class="nt">-ant</span> <span class="se">\</span>
scidb-<span class="nv">$SCIDB_VER</span><span class="nt">-cityhash</span> <span class="se">\</span>
scidb-<span class="nv">$SCIDB_VER</span><span class="nt">-libboost1</span>.54-all-dev <span class="se">\</span>
scidb-<span class="nv">$SCIDB_VER</span><span class="nt">-libcsv</span> <span class="se">\</span>
scidb-<span class="nv">$SCIDB_VER</span><span class="nt">-libmpich2-dev</span> <span class="se">\</span>
scidb-<span class="nv">$SCIDB_VER</span><span class="nt">-mpich2</span> <span class="se">\</span>
<span class="o">&&</span> <span class="nb">rm</span> <span class="nt">-rf</span> /var/lib/apt/lists/<span class="k">*</span>
</code></pre></div></div>
<p>Normally, these packages would be installed using the <code class="language-plaintext highlighter-rouge">deploy.sh</code> script provided with SciDB. For full control, we skip using this script and install the dependencies manually.</p>
<h1 id="building-scidb">Building SciDB</h1>
<p>In order to build SciDB, we first download the source code. The official SciDB source code location is on <a href="https://drive.google.com/folderview?id=0B7yt0n33Us0rT1FJdmxFV2g0OHc&usp=drive_web#list">Google Drive</a>. In order to download a file from Google Drive, we have to make two requests. The first request is to obtain some cookies and a confirmation code which are used in the second request. We extract the source code directly and skip saving the archive:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Get SciDB source code</span>
RUN wget <span class="nt">--no-verbose</span> <span class="nt">--output-document</span> - <span class="nt">--load-cookies</span> cookies.txt <span class="se">\</span>
<span class="s2">"</span><span class="nv">$SCIDB_SOURCE_URL</span><span class="s2">&</span><span class="sb">`</span>wget <span class="nt">--no-verbose</span> <span class="nt">--output-document</span> - <span class="se">\</span>
<span class="nt">--save-cookies</span> cookies.txt <span class="s2">"</span><span class="nv">$SCIDB_SOURCE_URL</span><span class="s2">"</span> | <span class="se">\</span>
<span class="nb">grep</span> <span class="nt">--only-matching</span> <span class="s1">'confirm=[^&]*'</span><span class="sb">`</span><span class="s2">"</span> | <span class="se">\</span>
<span class="nb">tar</span> <span class="nt">--extract</span> <span class="nt">--gzip</span> <span class="nt">--directory</span><span class="o">=</span>/usr/local/src
</code></pre></div></div>
<p>Next, we apply a set of patches provided by Paradigm4, also located on <a href="https://drive.google.com/drive/folders/0B8eyzr2ndWOTUmRsZUJoQU1tTmc">Google Drive</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Apply SciDB patches</span>
ADD https://docs.google.com/uc?id<span class="o">=</span>0B8eyzr2ndWOTSFRXWHhOc1ZYTGM&export<span class="o">=</span>download <span class="se">\</span>
<span class="nv">$SCIDB_SOURCE_PATH</span>/src/query/ops/input/ChunkLoader.h
ADD https://docs.google.com/uc?id<span class="o">=</span>0B8eyzr2ndWOTakhoVjloS2l1aVE&export<span class="o">=</span>download <span class="se">\</span>
<span class="nv">$SCIDB_SOURCE_PATH</span>/src/query/ops/input/ChunkLoader.cpp
</code></pre></div></div>
<p>Since SciDB is not intended to be built on Debian, we patch a few of the build scripts such that they run successfully under Debian. The full set of patches can be inspected <a href="https://github.com/rvernica/docker-library/tree/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/patch">here</a>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Apply Debian 8 patches</span>
COPY patch <span class="nv">$SCIDB_SOURCE_PATH</span><span class="nt">-patch</span>/
RUN <span class="nb">cd</span> <span class="nv">$SCIDB_SOURCE_PATH</span> <span class="o">&&</span> <span class="se">\</span>
<span class="nb">cat</span> <span class="nv">$SCIDB_SOURCE_PATH</span><span class="nt">-patch</span>/<span class="k">*</span> | patch <span class="nt">--strip</span><span class="o">=</span>1
</code></pre></div></div>
<p>Finally, we can start building SciDB. In <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile.pre"><code class="language-plaintext highlighter-rouge">Dockerfile.pre</code></a> we only build some of the libraries:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Build SciDB libraries (first few libs only)</span>
RUN <span class="nb">cd</span> <span class="nv">$SCIDB_SOURCE_PATH</span> <span class="o">&&</span> <span class="se">\</span>
<span class="nb">env </span><span class="nv">PATH</span><span class="o">=</span><span class="nv">$PATH</span>:/opt/scidb/<span class="nv">$SCIDB_VER</span>/3rdparty/mpich2/bin <span class="se">\</span>
./run.py setup <span class="nt">--force</span> <span class="o">&&</span> <span class="se">\</span>
<span class="nb">cd </span>stage/build <span class="o">&&</span> make <span class="nt">-j2</span> <span class="se">\</span>
json_lib <span class="se">\</span>
MurmurHash_lib <span class="se">\</span>
util_lib <span class="se">\</span>
scidb_msg_lib <span class="se">\</span>
genmeta <span class="se">\</span>
catalog_lib <span class="se">\</span>
array_lib <span class="se">\</span>
system_lib <span class="se">\</span>
compression_lib <span class="se">\</span>
ops_lib <span class="se">\</span>
scalar_proc_lib <span class="se">\</span>
qproc_lib <span class="se">\</span>
usr_namespace_lib <span class="se">\</span>
io_lib <span class="se">\</span>
network_lib
</code></pre></div></div>
<p>This concludes the first dockerfile. We continue our review with the second dockerfile, <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></a>. We base this image on the image built using <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile.pre"><code class="language-plaintext highlighter-rouge">Dockerfile.pre</code></a> and the first command is to continue and finish building SciDB:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FROM rvernica/scidb:15.12-pre
<span class="c">## Build SciDB (leftover)</span>
RUN <span class="nv">$SCIDB_SOURCE_PATH</span>/run.py make <span class="nt">-j2</span>
</code></pre></div></div>
<p>Next, we set some build arguments and corresponding environment variables for the SciDB installation running in this container. The build arguments specified with <code class="language-plaintext highlighter-rouge">ARG</code> (see <a href="https://docs.docker.com/engine/reference/builder/#/arg">Docker documentation</a>) can be modified at build time, if required. We will also install <a href="https://github.com/Paradigm4/shim">Shim</a> in this image and pin down a Shim version using the <code class="language-plaintext highlighter-rouge">SHA-1</code> of a GitHub commit. This avoids the surprise of picking up a newer and possibly incompatible Shim version at a later time:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ARG <span class="nv">SCIDB_INSTANCE_NUM</span><span class="o">=</span>2
ARG <span class="nv">SCIDB_NAME</span><span class="o">=</span>scidb
ARG <span class="nv">SCIDB_LOG_LEVEL</span><span class="o">=</span>WARN
ENV <span class="nv">SCIDB_INSTANCE_NUM</span><span class="o">=</span><span class="nv">$SCIDB_INSTANCE_NUM</span> <span class="se">\</span>
<span class="nv">SCIDB_NAME</span><span class="o">=</span><span class="nv">$SCIDB_NAME</span> <span class="se">\</span>
<span class="nv">SCIDB_DATA_PATH</span><span class="o">=</span><span class="nv">$SCIDB_INSTALL_PATH</span>/DB-<span class="nv">$SCIDB_NAME</span>
ENV <span class="nv">SHIM_SHA1</span><span class="o">=</span>854a4fb6c8f14e39010138ea045f0d3b431c607d <span class="se">\</span>
<span class="nv">SHIM_VERSION</span><span class="o">=</span>v<span class="nv">$SCIDB_VER</span><span class="nt">-20-g854a</span>
</code></pre></div></div>
<p>We now setup a password-less SSH server in the container. This might look redundant, but it is required by the installation script. The script assumes the installation is made on multiple hosts at a time and logins on each of them, even if the installation is only done on the current host. Moreover, we need to modify some settings in the Linux Pluggable Authentication Module (PAM) for the SSH server so that the SSH server allows connections inside the container (see <a href="https://docs.docker.com/engine/examples/running_ssh_service/">here</a>):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Setup SSH</span>
RUN <span class="nb">sed</span> <span class="nt">--in-place</span> <span class="se">\</span>
<span class="s1">'s/session\s*required\s*pam_loginuid.so/session optional pam_loginuid.so/g'</span> <span class="se">\</span>
/etc/pam.d/sshd <span class="o">&&</span> <span class="se">\</span>
<span class="nb">echo</span> <span class="s1">'StrictHostKeyChecking no'</span> <span class="o">>></span> /etc/ssh/ssh_config <span class="o">&&</span> <span class="se">\</span>
ssh-keygen <span class="nt">-f</span> /root/.ssh/id_rsa <span class="nt">-q</span> <span class="nt">-N</span> <span class="s2">""</span> <span class="o">&&</span> <span class="se">\</span>
<span class="nb">cp</span> /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
</code></pre></div></div>
<p>Before installing SciDB, we also need to setup PostgreSQL. We generate a random password for the PostgreSQL <code class="language-plaintext highlighter-rouge">root</code> user and we save it in the <code class="language-plaintext highlighter-rouge">.pgpass</code> file. Once we start the SSH and PostgreSQL services we are ready to install SciDB inside the container. We use the <code class="language-plaintext highlighter-rouge">run.py</code> script provided with SciDB to install SciDB. As a final step we set the log level to the one configured in the environment:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Setup PostgreSQL and SciDB</span>
RUN <span class="nb">echo</span> <span class="s2">"127.0.0.1:5432:</span><span class="nv">$SCIDB_NAME</span><span class="s2">:</span><span class="nv">$SCIDB_NAME</span><span class="s2">:</span><span class="sb">`</span><span class="nb">date</span> +%s | <span class="nb">sha256sum</span> | <span class="nb">base64</span> | <span class="nb">head</span> <span class="nt">-c</span> 32<span class="sb">`</span><span class="s2">"</span> <span class="se">\</span>
<span class="o">></span> /root/.pgpass <span class="o">&&</span> <span class="se">\</span>
<span class="nb">chmod </span>go-r /root/.pgpass <span class="o">&&</span> <span class="se">\</span>
service ssh start <span class="o">&&</span> <span class="se">\</span>
service postgresql start <span class="o">&&</span> <span class="se">\</span>
<span class="nb">echo </span>n | <span class="nv">$SCIDB_SOURCE_PATH</span>/run.py <span class="nb">install</span> <span class="o">&&</span> <span class="se">\</span>
<span class="nb">sed</span> <span class="nt">--in-place</span> <span class="se">\</span>
s/log4j.rootLogger<span class="o">=</span>DEBUG/log4j.rootLogger<span class="o">=</span><span class="nv">$SCIDB_LOG_LEVEL</span>/ <span class="se">\</span>
<span class="nv">$SCIDB_INSTALL_PATH</span>/share/scidb/log1.properties
</code></pre></div></div>
<p>Next, we install Shim. We download and unpack the source code of the pinned version from GitHub. We set the Shim version in the <code class="language-plaintext highlighter-rouge">Makefile</code> and run the <code class="language-plaintext highlighter-rouge">make service</code> command which compiles and installs Shim as a service:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Install Shim</span>
RUN wget <span class="nt">--no-verbose</span> <span class="nt">--output-document</span> - <span class="se">\</span>
https://github.com/Paradigm4/shim/archive/<span class="nv">$SHIM_SHA1</span>.tar.gz | <span class="se">\</span>
<span class="nb">tar</span> <span class="nt">--extract</span> <span class="nt">--gzip</span> <span class="nt">--directory</span><span class="o">=</span>/usr/local/src <span class="o">&&</span> <span class="se">\</span>
<span class="nb">cd</span> /usr/local/src/shim-<span class="nv">$SHIM_SHA1</span> <span class="o">&&</span> <span class="se">\</span>
<span class="nb">sed</span> <span class="nt">--in-place</span> <span class="s2">"s/^GIT_VERSION := .*</span><span class="nv">$/</span><span class="s2">GIT_VERSION := </span><span class="nv">$SHIM_VERSION</span><span class="s2">/"</span> src/Makefile <span class="o">&&</span> <span class="se">\</span>
make service
</code></pre></div></div>
<p>We finalize the image by setting an <code class="language-plaintext highlighter-rouge">ENTRYPOINT</code> script (see <a href="https://docs.docker.com/engine/reference/builder/#/entrypoint">Docker documentation</a>) and exposing the SciDB and Shim ports. The entry point script is discussed in the next section.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>COPY docker-entrypoint.sh /
ENTRYPOINT <span class="o">[</span><span class="s2">"/docker-entrypoint.sh"</span><span class="o">]</span>
<span class="c">## Port | App</span>
<span class="c">## -----+-----</span>
<span class="c">## 1239 | SciDB iquery</span>
<span class="c">## 8080 | SciDB Shim (HTTP)</span>
<span class="c">## 8083 | SciDB Shim (HTTPS)</span>
EXPOSE 1239 8080 8083
</code></pre></div></div>
<h1 id="entry-point-script">Entry Point Script</h1>
<p>An entry point script in a Docker image is executed every time the container starts. It is normally used for starting and initializing various services in the container. In our case, we use this script to start SSH, PostgreSQL, Shim, and SciDB. The <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/docker-entrypoint.sh"><code class="language-plaintext highlighter-rouge">docker-entrypoint.sh</code></a> script looks like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="nb">set</span> <span class="nt">-o</span> errexit
service ssh start
service postgresql start
service shimsvc start
<span class="nv">$SCIDB_INSTALL_PATH</span>/bin/scidb.py startall <span class="nv">$SCIDB_NAME</span>
<span class="nb">trap</span> <span class="s2">"</span><span class="nv">$SCIDB_INSTALL_PATH</span><span class="s2">/bin/scidb.py stopall </span><span class="nv">$SCIDB_NAME</span><span class="s2">; </span><span class="se">\</span><span class="s2">
service postgresql stop"</span> <span class="se">\</span>
EXIT HUP INT QUIT TERM
<span class="k">if</span> <span class="o">[</span> <span class="s2">"</span><span class="nv">$1</span><span class="s2">"</span> <span class="o">=</span> <span class="s1">''</span> <span class="o">]</span>
<span class="k">then
</span><span class="nb">tail</span> <span class="nt">-f</span> <span class="nv">$SCIDB_DATA_PATH</span>/0/0/scidb.log
<span class="k">else
</span><span class="nb">exec</span> <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span>
<span class="k">fi</span>
</code></pre></div></div>
<p>Once we start all the services, we trap any exit or interrupt signals and stop the SciDB and PostgreSQL services when such signals are generated. This allows us to do a clean shutdown of the databases when the container is stopped. If a command is provided when the container is started (<code class="language-plaintext highlighter-rouge">$1</code>), for example, <code class="language-plaintext highlighter-rouge">bash</code>, we execute that command, otherwise, we tail the SciDB logs.</p>
<h1 id="using-the-image">Using the Image</h1>
<p>Using the two dockerfiles, <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile.pre"><code class="language-plaintext highlighter-rouge">Dockerfile.pre</code></a> and <a href="https://github.com/rvernica/docker-library/blob/30fae11aad94e9fd38e67168e79a2ed80ca660d5/scidb/15.12/Dockerfile"><code class="language-plaintext highlighter-rouge">Dockerfile</code></a>, we can build the two images locally like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker build <span class="nt">--tag</span> rvernica/scidb:15.12-pre <span class="nt">--file</span> Dockerfile.pre <span class="nb">.</span>
Sending build context to Docker daemon 34.82 kB
Step 1/16 : FROM debian:8
<span class="nt">---</span><span class="o">></span> 1b088884749b
...
Step 16/16 : RUN <span class="nb">cd</span> <span class="nv">$SCIDB_SOURCE_PATH</span> <span class="o">&&</span> ...
...
<span class="nt">---</span><span class="o">></span> f8e9e0a1fe8a
Removing intermediate container 754d00074d53
Successfully built f8e9e0a1fe8a
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker build <span class="nt">--tag</span> rvernica/scidb:15.12 <span class="nb">.</span>
Sending build context to Docker daemon 34.82 kB
Step 1/14 : FROM rvernica/scidb:15.12-pre
<span class="nt">---</span><span class="o">></span> f8e9e0a1fe8a
...
Step 14/14 : EXPOSE 1239 8080 8083
<span class="nt">---</span><span class="o">></span> Running <span class="k">in </span>0b895e473a16
<span class="nt">---</span><span class="o">></span> 4536cbbc3a0e
Removing intermediate container 0b895e473a16
Successfully built 4536cbbc3a0e
</code></pre></div></div>
<p>As an alternative, we can download the already built images from <a href="https://hub.docker.com/r/rvernica/scidb/">Docker Hub</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker pull rvernica/scidb:15.12-pre
15.12-pre: Pulling from rvernica/scidb
386a066cd84a: Pull <span class="nb">complete
</span>3364855bee9a: Pull <span class="nb">complete
</span>1d5a83062528: Pull <span class="nb">complete
</span>58b5c175470a: Pull <span class="nb">complete
</span>725863ff1c79: Pull <span class="nb">complete
</span>8d3cadf8ac47: Pull <span class="nb">complete
</span>066e3f9e305c: Pull <span class="nb">complete
</span>24cf7b021165: Pull <span class="nb">complete
</span>fa8345d54686: Pull <span class="nb">complete
</span>0a77a3de8243: Pull <span class="nb">complete
</span>b7a2f2bab106: Pull <span class="nb">complete
</span>Digest: sha256:439c80c3232465236c97ba0aa4880188b7c36117a377c568ab31823174d80597
Status: Downloaded newer image <span class="k">for </span>rvernica/scidb:15.12-pre
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker pull rvernica/scidb:15.12
15.12: Pulling from rvernica/scidb
386a066cd84a: Already exists
3364855bee9a: Already exists
1d5a83062528: Already exists
58b5c175470a: Already exists
725863ff1c79: Already exists
8d3cadf8ac47: Already exists
066e3f9e305c: Already exists
24cf7b021165: Already exists
fa8345d54686: Already exists
0a77a3de8243: Already exists
b7a2f2bab106: Already exists
e1adf4ff3e84: Pull <span class="nb">complete
</span>0bdcfbd8cc1e: Pull <span class="nb">complete
</span>83e1b10fc795: Pull <span class="nb">complete
</span>ca9967ab1660: Pull <span class="nb">complete
</span>e10871d83d82: Pull <span class="nb">complete
</span>Digest: sha256:e90e7e6da3b912939e47269cacdbd5fcbff275a4755dd19dfe601cff95fcac50
Status: Downloaded newer image <span class="k">for </span>rvernica/scidb:15.12
</code></pre></div></div>
<p>We can download directly <code class="language-plaintext highlighter-rouge">scidb:15.12</code> without downloading <code class="language-plaintext highlighter-rouge">scidb:15.12-pre</code> first. Docker automatically downloads the necessary layers. Once we have the images, we can take a look at their size with <code class="language-plaintext highlighter-rouge">docker images</code>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
rvernica/scidb 15.12 e259f6d496ba 7 days ago 1.89 GB
rvernica/scidb 15.12-pre 913b76a3a60c 7 days ago 1.53 GB
</code></pre></div></div>
<p>Note that the total space occupied on the disk is not the sum of their sizes, but the maximum. Now, can start a Docker container using:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker run <span class="nt">--tty</span> <span class="nt">--interactive</span> rvernica/scidb:15.12 bash
<span class="o">[</span> ok <span class="o">]</span> Starting OpenBSD Secure Shell server: sshd.
<span class="o">[</span> ok <span class="o">]</span> Starting PostgreSQL 9.4 database server: main.
Starting shim
shim: SciDB HTTP service started on port<span class="o">(</span>s<span class="o">)</span> 8080,8083, with web root <span class="o">[</span>/var/lib/shim/wwwroot], talking to SciDB on port 1239
scidb.py: INFO: Found 0 scidb processes
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 0<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 1<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
root@3cb209a92b40:/# iquery <span class="nt">--afl</span> <span class="nt">--query</span> <span class="s2">"list('libraries')"</span>
<span class="o">{</span>inst,n<span class="o">}</span> name,major,minor,patch,build,build_type
<span class="o">{</span>0,0<span class="o">}</span> <span class="s1">'SciDB'</span>,15,12,1,80403125,<span class="s1">'Release'</span>
<span class="o">{</span>1,0<span class="o">}</span> <span class="s1">'SciDB'</span>,15,12,1,80403125,<span class="s1">'Release'</span>
root@3cb209a92b40:/# <span class="nb">exit
exit</span>
</code></pre></div></div>
<p>Notice how we specify <code class="language-plaintext highlighter-rouge">bash</code> as the command to be passed to the entry point script. The command is executed by the entry point script once it completes starting SciDB. In the Bash terminal, we can then connect to SciDB using <code class="language-plaintext highlighter-rouge">iquery</code> (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/The+iquery+Client">documentation</a>). We can also directly start the <code class="language-plaintext highlighter-rouge">iquery</code> client without using a Bash terminal:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker run <span class="nt">--tty</span> <span class="nt">--interactive</span> rvernica/scidb:15.12 iquery <span class="nt">--afl</span>
<span class="o">[</span> ok <span class="o">]</span> Starting OpenBSD Secure Shell server: sshd.
<span class="o">[</span> ok <span class="o">]</span> Starting PostgreSQL 9.4 database server: main.
Starting shim
shim: SciDB HTTP service started on port<span class="o">(</span>s<span class="o">)</span> 8080,8083, with web root <span class="o">[</span>/var/lib/shim/wwwroot], talking to SciDB on port 1239
scidb.py: INFO: Found 0 scidb processes
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 0<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 1<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
AFL%
</code></pre></div></div>
<p>If we don’t specify any command, the container will tail the SciDB logs and shutdown gracefully when asked to terminate:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker run <span class="nt">--tty</span> <span class="nt">--interactive</span> rvernica/scidb:15.12
<span class="o">[</span> ok <span class="o">]</span> Starting OpenBSD Secure Shell server: sshd.
<span class="o">[</span> ok <span class="o">]</span> Starting PostgreSQL 9.4 database server: main.
Starting shim
shim: SciDB HTTP service started on port<span class="o">(</span>s<span class="o">)</span> 8080,8083, with web root <span class="o">[</span>/var/lib/shim/wwwroot], talking to SciDB on port 1239
scidb.py: INFO: Found 0 scidb processes
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 0<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 1<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
load <span class="o">=</span> fn<span class="o">(</span>output_array,input_file,instance_id,format,max_errors,shadow_array,isStrict<span class="o">){</span>store<span class="o">(</span>input<span class="o">(</span>output_array,input_file,instance_id,format,max_errors,shadow_array,isStrict<span class="o">)</span>,output_array<span class="o">)}</span><span class="p">;</span>
sys_create_array_aux <span class="o">=</span> fn<span class="o">(</span>_A_,_E_,_C_<span class="o">){</span><span class="nb">join</span><span class="o">(</span>aggregate<span class="o">(</span>apply<span class="o">(</span>_A_,_t_,_E_<span class="o">)</span>,approxdc<span class="o">(</span>_t_<span class="o">))</span>,build<span class="o">(</span><values_per_chunk:uint64 null>[i<span class="o">=</span>0:0,1,0],_C_<span class="o">))}</span><span class="p">;</span>
sys_create_array_att <span class="o">=</span> fn<span class="o">(</span>_L_,_S_,_D_<span class="o">){</span>redimension<span class="o">(</span><span class="nb">join</span><span class="o">(</span>build<span class="o">(</span><n:int64 null,lo:int64 null,hi:int64 null,ci:int64 null,co:int64 null>[No<span class="o">=</span>0:0,1,0],_S_,true<span class="o">)</span>,cast<span class="o">(</span>aggregate<span class="o">(</span>_L_,min<span class="o">(</span>_D_<span class="o">)</span>,max<span class="o">(</span>_D_<span class="o">)</span>,approxdc<span class="o">(</span>_D_<span class="o">))</span>,<min:int64 null,max:int64 null,count:int64 null>[No<span class="o">=</span>0:0,1,0]<span class="o">))</span>,<lo:int64 null,hi:int64 null,ci:int64 null,co:int64 null,min:int64 null,max:int64 null,count:int64 null>[n<span class="o">=</span>0:<span class="k">*</span>,?,0]<span class="o">)}</span><span class="p">;</span>
sys_create_array_dim <span class="o">=</span> fn<span class="o">(</span>_L_,_S_,_D_<span class="o">){</span>redimension<span class="o">(</span><span class="nb">join</span><span class="o">(</span>build<span class="o">(</span><n:int64 null,lo:int64 null,hi:int64 null,ci:int64 null,co:int64 null>[No<span class="o">=</span>0:0,1,0],_S_,true<span class="o">)</span>,cast<span class="o">(</span>aggregate<span class="o">(</span>apply<span class="o">(</span>aggregate<span class="o">(</span>_L_,count<span class="o">(</span><span class="k">*</span><span class="o">)</span>,_D_<span class="o">)</span>,_t_,_D_<span class="o">)</span>,min<span class="o">(</span>_t_<span class="o">)</span>,max<span class="o">(</span>_t_<span class="o">)</span>,count<span class="o">(</span><span class="k">*</span><span class="o">))</span>,<min:int64 null,max:int64 null,count:int64 null>[No<span class="o">=</span>0:0,1,0]<span class="o">))</span>,<lo:int64 null,hi:int64 null,ci:int64 null,co:int64 null,min:int64 null,max:int64 null,count:int64 null>[n<span class="o">=</span>0:<span class="k">*</span>,?,0]<span class="o">)}</span>
2016-11-23 21:48:37,126 <span class="o">[</span>0x7f5d36edf7c0] <span class="o">[</span>DEBUG]: Network manager is intialized
2016-11-23 21:48:37,126 <span class="o">[</span>0x7f5d36edf7c0] <span class="o">[</span>DEBUG]: NetworkManager::run<span class="o">()</span>
2016-11-23 21:48:37,126 <span class="o">[</span>0x7f5d36edf7c0] <span class="o">[</span>DEBUG]: server-id <span class="o">=</span> 0
2016-11-23 21:48:37,126 <span class="o">[</span>0x7f5d36edf7c0] <span class="o">[</span>DEBUG]: server-instance-id <span class="o">=</span> 0
2016-11-23 21:48:37,136 <span class="o">[</span>0x7f5d36edf7c0] <span class="o">[</span>DEBUG]: Registered instance <span class="c"># 0</span>
2016-11-23 21:48:37,136 <span class="o">[</span>0x7f5d36edf7c0] <span class="o">[</span>INFO <span class="o">]</span>: SciDB instance. SciDB Version: 15.12.1. Build Type: Release. Commit: 4cadab5. Copyright <span class="o">(</span>C<span class="o">)</span> 2008-2015 SciDB, Inc. is exiting.
^C
scidb.py: INFO: stop<span class="o">(</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">))</span>
scidb.py: INFO: checking <span class="o">(</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">))</span> 183 184...
scidb.py: INFO: Found 2 scidb processes
scidb.py: INFO: checking <span class="o">(</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">))</span> 183 184...
scidb.py: INFO: Found 2 scidb processes
scidb.py: INFO: checking <span class="o">(</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">))</span> 183 184...
scidb.py: INFO: Found 2 scidb processes
scidb.py: INFO: checking <span class="o">(</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">))</span> 183 184...
scidb.py: INFO: Found 2 scidb processes
scidb.py: INFO: Found 0 scidb processes
<span class="o">[</span> ok <span class="o">]</span> Stopping PostgreSQL 9.4 database server: main.
scidb.py: INFO: stop<span class="o">(</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">))</span>
scidb.py: INFO: Found 0 scidb processes
<span class="o">[</span> ok <span class="o">]</span> Stopping PostgreSQL 9.4 database server: main.
</code></pre></div></div>
<p>The dockerfiles discussed here as well as other dockerfiles, including dockerfiles for SciDB <code class="language-plaintext highlighter-rouge">15.7</code> are available <a href="https://github.com/rvernica/docker-library/tree/master/scidb">here</a>. The Docker images built using these dockerfiles are available <a href="https://hub.docker.com/r/rvernica/scidb/">here</a>.</p>Rares VernicaIn an earlier post, we looked at how to create a Docker image for SciDB. The image built in that post followed the SciDB Community Edition Installation Guide very closely. The image is functional and a good learning resource, but not very efficient. The image uses around 6GB of space and cannot be build automatically on Docker Hub due to long build time. In this post, we revisit this topic and try to build a more efficient Docker image for SciDB. The source files for the image is available on GitHub here. The image is available on Docker Hub here.Extending SciDB - Part 12016-10-01T00:00:00+00:002016-10-01T00:00:00+00:00http://rvernica.github.io/2016/10/extend-scidb-doc<p>One of the strengths of SciDB over other database management systems is its extensibility.<sup id="bn1"><a href="#fn1">1</a></sup> SciDB allows the user to add new data types, functions, and operators. In this multi-part post, we discuss various aspects of extending SciDB. In this post we look at the available documentation and how to setup the development tools.</p>
<h1 id="documentation-on-extending-scidb">Documentation on Extending SciDB</h1>
<p>The official documentation on extending SciDB is at best sparse. There is some discussion in <a href="https://youtu.be/SsF_Mke0Mlw?t=10040">this</a> SciDB tutorial from 2013. (We talk more about tutorials in SciDB in <a href="/2016/07/tutorials">this post.</a>) Older SciDB versions include a special chapter in the documentation on <em>User-Defined Types and Functions</em>. The most recent such chapter is for SciDB version <code class="language-plaintext highlighter-rouge">14.8</code> and is available <a href="http://www.paradigm4.com/HTMLmanual/14.8/scidb_ug/ch15.html">here</a>. More recent SciDB versions do not contain such a chapter in the documentation. This documentation chapter can provide the reader with an overview of what it takes to create user defined type and functions.</p>
<p>Another useful piece of documentation is the documentation embedded in the SciDB code. This documentation is not officially supported by Paradigm4, but can be generated from the source code using the <a href="http://www.stack.nl/~dimitri/doxygen/">Doxygen</a> documentation generator. This can be done right after completing the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Installing+SciDB+Community+Edition#InstallingSciDBCommunityEdition-BuildingSciDBCE">Building SciDB CE</a> step of the official <a href="https://paradigm4.atlassian.net/wiki/display/ESD/SciDB+Community+Edition+Installation+Guide">SciDB Community Edition Installation Guide</a>. The following sequence of commands can be used:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span> <span class="nb">cd</span> <dev_dir>/scidbtrunk
<span class="o">></span> <span class="nb">echo</span> <span class="s1">'add_subdirectory("doc")'</span> <span class="o">>></span> CMakeLists.txt
<span class="o">></span> ./run.py setup
<span class="o">></span> <span class="nb">cd </span>stage/build
<span class="o">></span> make doc
<span class="o">></span> <span class="nb">ls </span>doc/api/html/index.html
</code></pre></div></div>
<p>Note that the Doxygen package needs to be installed first. The SciDB code documentation can now be browsed starting for the <code class="language-plaintext highlighter-rouge">index.html</code> file listed above.</p>
<p>Finally, the best available “documentation” on extending SciDB are the existing plugins provided by Paradigm4 on their <a href="https://github.com/Paradigm4">GitHub page</a> or third-party developers. One special such plugin is the <code class="language-plaintext highlighter-rouge">dev_tools</code> plugin (see <a href="https://github.com/Paradigm4/dev_tools">GitHub</a>). We discuss this plugin next.</p>
<h1 id="installing-the-development-tools">Installing the Development Tools</h1>
<p>One of the SciDB plugins provided by Paradigm4 on their <a href="https://github.com/Paradigm4">GitHub page</a> is the <code class="language-plaintext highlighter-rouge">dev_tools</code> plugin (see <a href="https://github.com/Paradigm4/dev_tools">GitHub</a>). Once installed, this plugin allows the user to install other SciDB plugins directly from GitHub. The plugin comes with a detailed <a href="https://github.com/Paradigm4/dev_tools/blob/master/README.md">README</a> file on how to install this plugin for different SciDB versions and Linux distributions. We highly recommend installing this plugin first. Once installed, we can load it and verify the available operators using:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">load_library</span><span class="p">(</span><span class="s">'dev_tools'</span><span class="p">);</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
<span class="n">AFL</span><span class="o">%</span> <span class="nb">filter</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="s">'operators'</span><span class="p">),</span> <span class="n">library</span> <span class="o">=</span> <span class="s">'dev_tools'</span><span class="p">);</span>
<span class="p">{</span><span class="n">No</span><span class="p">}</span> <span class="n">name</span><span class="p">,</span><span class="n">library</span>
<span class="p">{</span><span class="mi">21</span><span class="p">}</span> <span class="s">'install_github'</span><span class="p">,</span><span class="s">'dev_tools'</span>
</code></pre></div></div>
<p>Notice the newly available operator <code class="language-plaintext highlighter-rouge">install_github</code>. See the <a href="https://github.com/Paradigm4/dev_tools/blob/master/README.md#synopsis">Synopsis</a> and the <a href="https://github.com/Paradigm4/dev_tools/blob/master/README.md#example">Example</a> provided in the plugin README file for more information on how to use this operator. Now, we can use this operator to install other SciDB plugins from GitHub. A good plugin to start with is the <code class="language-plaintext highlighter-rouge">limit</code> plugin (see <a href="https://github.com/Paradigm4/limit">GitHub</a>) also provided by Paradigm4. This is a good minimal plugin which can be used as a skeleton for new plugins and guidance on how to organize the code. We can install, load, and verify the available operators using:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">install_github</span><span class="p">(</span><span class="s">'paradigm4/limit'</span><span class="p">);</span>
<span class="p">{</span><span class="n">i</span><span class="p">}</span> <span class="n">success</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="n">true</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">load_library</span><span class="p">(</span><span class="s">'limit'</span><span class="p">);</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
<span class="n">AFL</span><span class="o">%</span> <span class="nb">filter</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="s">'operators'</span><span class="p">),</span> <span class="n">library</span> <span class="o">=</span> <span class="s">'limit'</span><span class="p">);</span>
<span class="p">{</span><span class="n">No</span><span class="p">}</span> <span class="n">name</span><span class="p">,</span><span class="n">library</span>
<span class="p">{</span><span class="mi">24</span><span class="p">}</span> <span class="s">'limit'</span><span class="p">,</span><span class="s">'limit'</span>
</code></pre></div></div>
<p>The steps and queries used in this post are available <a href="https://github.com/rvernica/scidb-examples/tree/master/install-plugin">here</a>, as follows:</p>
<ul>
<li>Generate SciDB code documentation: see <a href="https://github.com/rvernica/scidb-examples/blob/master/install-plugin/Dockerfile#L7">Dockerfile</a></li>
<li>Install <code class="language-plaintext highlighter-rouge">dev_tools</code> plugin: see <a href="https://github.com/rvernica/scidb-examples/blob/master/install-plugin/Dockerfile#L16">Dockerfile</a></li>
<li>Install <code class="language-plaintext highlighter-rouge">limit</code> plugin: see <a href="https://github.com/rvernica/scidb-examples/blob/master/install-plugin/query.afl#L4">query.afl</a></li>
</ul>
<hr />
<p><b id="fn1">1</b> One notable exception is <a href="https://www.postgresql.org/">PostgreSQL</a> which was also lead by <a href="http://amturing.acm.org/award_winners/stonebraker_1172121.cfm">Michael Stonebraker.</a> <a href="#bn1">↩</a></p>Rares VernicaOne of the strengths of SciDB over other database management systems is its extensibility.1 SciDB allows the user to add new data types, functions, and operators. In this multi-part post, we discuss various aspects of extending SciDB. In this post we look at the available documentation and how to setup the development tools.The Power of Loading Data - Part 32016-08-01T00:00:00+00:002016-08-01T00:00:00+00:00http://rvernica.github.io/2016/08/load-data-table<p>In <a href="/2016/05/load-data">part 1</a> and <a href="/2016/06/load-data-non-int">part 2</a> of this multi-part post, we looked a how to load data from multiple files while capturing information present in the file name. In this post, we look at how to load data files organized as tables with a possibly large number of columns and header rows.</p>
<h1 id="simple-table-like-data">Simple Table-like Data</h1>
<p>First, let us take a look at how to load a data file with a relatively small number of columns. Assume for example that our data has three columns:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cat rec-1.txt
</span><span class="mi">10</span> <span class="mi">20</span> <span class="mi">30</span>
<span class="mi">12</span> <span class="mi">22</span> <span class="mi">32</span>
<span class="mi">14</span> <span class="mi">24</span> <span class="mi">34</span>
</code></pre></div></div>
<p>We use the <code class="language-plaintext highlighter-rouge">accelerated_io_tools</code> plugin. <a href="/2016/05/load-data">Part 1</a> of this multi-part post starts with a brief intro to this plugin. We assume that the plugin is installed and loaded. See the plugin <a href="https://github.com/Paradigm4/accelerated_io_tools/tree/fd85a44849fe0aba285078f6cd999ab8c57560d7#installation">documentation</a> for installation and loading instructions. Using the <code class="language-plaintext highlighter-rouge">aio_input</code> operator with the <code class="language-plaintext highlighter-rouge">num_attributes</code> parameter (see <a href="https://github.com/Paradigm4/accelerated_io_tools/tree/fd85a44849fe0aba285078f6cd999ab8c57560d7#file-format-settings">documentation</a>) set to <code class="language-plaintext highlighter-rouge">3</code> we can load our example file like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=3'</span><span class="p">);</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">}</span> <span class="n">a0</span><span class="p">,</span><span class="n">a1</span><span class="p">,</span><span class="n">a2</span><span class="p">,</span><span class="n">error</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'10'</span><span class="p">,</span><span class="s">'20'</span><span class="p">,</span><span class="s">'30'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'12'</span><span class="p">,</span><span class="s">'22'</span><span class="p">,</span><span class="s">'32'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'14'</span><span class="p">,</span><span class="s">'24'</span><span class="p">,</span><span class="s">'34'</span><span class="p">,</span><span class="n">null</span>
</code></pre></div></div>
<p>Notice how the three columns became three attributes, <code class="language-plaintext highlighter-rouge">a0</code>, <code class="language-plaintext highlighter-rouge">a1</code>, and <code class="language-plaintext highlighter-rouge">a2</code>. The result can be converted to an array with one dimension and two numeric attributes using the <code class="language-plaintext highlighter-rouge">apply</code> (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/apply">documentation</a>) and <code class="language-plaintext highlighter-rouge">redimension</code> (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/redimension">documentation</a>) operators:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=3'</span><span class="p">),</span>
<span class="n">x</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">y</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a1</span><span class="p">),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a2</span><span class="p">)),</span>
<span class="o"><</span><span class="n">x</span><span class="p">:</span> <span class="n">int64</span><span class="p">,</span> <span class="n">y</span><span class="p">:</span> <span class="n">int64</span><span class="p">,</span> <span class="n">val</span><span class="p">:</span> <span class="n">int64</span><span class="o">></span><span class="p">[</span><span class="n">tuple_no</span><span class="p">]);</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">}</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="mi">10</span><span class="p">,</span><span class="mi">20</span><span class="p">,</span><span class="mi">30</span>
<span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="mi">12</span><span class="p">,</span><span class="mi">22</span><span class="p">,</span><span class="mi">32</span>
<span class="p">{</span><span class="mi">2</span><span class="p">}</span> <span class="mi">14</span><span class="p">,</span><span class="mi">24</span><span class="p">,</span><span class="mi">34</span>
</code></pre></div></div>
<p>In the case where some of the columns are dimensions, this can be achieved by adjusting the template array used for the <code class="language-plaintext highlighter-rouge">redimension</code> operator:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=3'</span><span class="p">),</span>
<span class="n">x</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">y</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a1</span><span class="p">),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a2</span><span class="p">)),</span>
<span class="o"><</span><span class="n">val</span><span class="p">:</span> <span class="n">int64</span><span class="o">></span><span class="p">[</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">]);</span>
<span class="p">{</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">tuple_no</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">10</span><span class="p">,</span><span class="mi">20</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">30</span>
<span class="p">{</span><span class="mi">12</span><span class="p">,</span><span class="mi">22</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">32</span>
<span class="p">{</span><span class="mi">14</span><span class="p">,</span><span class="mi">24</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">34</span>
</code></pre></div></div>
<p>Notice that in the template array used for the <code class="language-plaintext highlighter-rouge">redimension</code> operator <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> are no longer attributes, but dimensions.</p>
<h1 id="large-table-like-data">Large Table-like Data</h1>
<p>The queries from the previous section work well for data with a relatively small number of columns where the columns can be enumerated and manipulated directly. In the case where the data has a large number of columns (i.e., tens or hundreds of columns), enumerating and manipulating the columns directly is not practical. A more practical solution is to have an additional dimension along the columns of the data. The <code class="language-plaintext highlighter-rouge">aio_input</code> operator comes with a <code class="language-plaintext highlighter-rouge">split_on_dimension</code> parameter (see <a href="https://github.com/Paradigm4/accelerated_io_tools/tree/fd85a44849fe0aba285078f6cd999ab8c57560d7#splitting-on-dimension">documentation</a>) which allows us to do exactly that. Let us load the same data file from the previous section:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=3'</span><span class="p">,</span> <span class="s">'split_on_dimension=1'</span><span class="p">);</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">,</span><span class="n">attribute_no</span><span class="p">}</span> <span class="n">a</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'10'</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="s">'20'</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="s">'30'</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">}</span> <span class="n">null</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'12'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="s">'22'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="s">'32'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">}</span> <span class="n">null</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'14'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="s">'24'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="s">'34'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">}</span> <span class="n">null</span>
</code></pre></div></div>
<p>Notice how we have a fourth dimension, <code class="language-plaintext highlighter-rouge">attribute_no</code>, which stores the column index for each data element. Assuming our original data is a matrix, we can redimension the result like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">redimension</span><span class="p">(</span>
<span class="n">between</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=3'</span><span class="p">,</span> <span class="s">'split_on_dimension=1'</span><span class="p">),</span>
<span class="n">i</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">j</span><span class="p">,</span> <span class="n">attribute_no</span><span class="p">,</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a</span><span class="p">)),</span>
<span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span>
<span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="o"><</span><span class="n">val</span><span class="p">:</span><span class="n">int64</span><span class="o">></span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">]);</span>
<span class="p">{</span><span class="n">i</span><span class="p">,</span><span class="n">j</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">30</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">12</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">22</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">32</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">14</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">24</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">34</span>
</code></pre></div></div>
<p>Besides the <code class="language-plaintext highlighter-rouge">apply</code> and <code class="language-plaintext highlighter-rouge">redimension</code> operators, we also use the <code class="language-plaintext highlighter-rouge">between</code> operator (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/between">documentation</a>). The <code class="language-plaintext highlighter-rouge">between</code> operator helps us select just the data positions of the <code class="language-plaintext highlighter-rouge">attribute_no</code> dimension and discard the last position which stores the data-loading errors.</p>
<p>Notice that, even if the example data is small, we do not enumerate the columns of the data, like in the previous section, instead, we just use their count. So, the examples presented in this section could be used for use-cases where the input data has a large number of columns.</p>
<h1 id="large-table-like-data-with-header-row">Large Table-like Data with Header Row</h1>
<p>Let us take our example a step further and assume that our data contains a header row with reference values for each column. For example, the header could contain the column position on a continuous dimension:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cat rec-2-1.txt
</span><span class="mf">1.1</span> <span class="mf">1.3</span> <span class="mf">1.5</span>
<span class="mi">10</span> <span class="mi">20</span> <span class="mi">30</span>
<span class="mi">12</span> <span class="mi">22</span> <span class="mi">32</span>
<span class="mi">14</span> <span class="mi">24</span> <span class="mi">34</span>
</code></pre></div></div>
<table>
<thead>
<tr>
<th>1.1</th>
<th>1.3</th>
<th>1.5</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="language-plaintext highlighter-rouge">10</code></td>
<td><code class="language-plaintext highlighter-rouge">20</code></td>
<td><code class="language-plaintext highlighter-rouge">30</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">12</code></td>
<td><code class="language-plaintext highlighter-rouge">22</code></td>
<td><code class="language-plaintext highlighter-rouge">32</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">14</code></td>
<td><code class="language-plaintext highlighter-rouge">24</code></td>
<td><code class="language-plaintext highlighter-rouge">34</code></td>
</tr>
</tbody>
</table>
<p>Moreover, different data files can contain different sets of columns. For example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cat rec-2-2.txt
</span><span class="mf">1.1</span> <span class="mf">1.2</span> <span class="mf">1.3</span>
<span class="mi">16</span> <span class="mi">26</span> <span class="mi">36</span>
<span class="mi">18</span> <span class="mi">28</span> <span class="mi">38</span>
</code></pre></div></div>
<table>
<thead>
<tr>
<th>1.1</th>
<th>1.2</th>
<th>1.3</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="language-plaintext highlighter-rouge">16</code></td>
<td><code class="language-plaintext highlighter-rouge">26</code></td>
<td><code class="language-plaintext highlighter-rouge">36</code></td>
</tr>
<tr>
<td><code class="language-plaintext highlighter-rouge">18</code></td>
<td><code class="language-plaintext highlighter-rouge">28</code></td>
<td><code class="language-plaintext highlighter-rouge">38</code></td>
</tr>
</tbody>
</table>
<p>The goal is to keep track of the column name for each data value. So, in SciDB, the two examples would be represented in arrays like this:</p>
<table>
<thead>
<tr>
<th>Row\Col</th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>1</strong></td>
<td><code class="language-plaintext highlighter-rouge">(1.1, 10)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.3, 20)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.5, 30)</code></td>
</tr>
<tr>
<td><strong>2</strong></td>
<td><code class="language-plaintext highlighter-rouge">(1.1, 12)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.3, 22)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.5, 32)</code></td>
</tr>
<tr>
<td><strong>3</strong></td>
<td><code class="language-plaintext highlighter-rouge">(1.1, 14)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.3, 24)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.5, 34)</code></td>
</tr>
</tbody>
</table>
<p>and:</p>
<table>
<thead>
<tr>
<th>Row\Col</th>
<th>1</th>
<th>2</th>
<th>3</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>1</strong></td>
<td><code class="language-plaintext highlighter-rouge">(1.1, 16)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.2, 26)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.3, 36)</code></td>
</tr>
<tr>
<td><strong>2</strong></td>
<td><code class="language-plaintext highlighter-rouge">(1.1, 18)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.2, 28)</code></td>
<td><code class="language-plaintext highlighter-rouge">(1.3, 38)</code></td>
</tr>
</tbody>
</table>
<p>As such, let us first create the array for storing this data:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">create</span> <span class="n">array</span> <span class="n">rec</span><span class="o"><</span><span class="n">pos</span><span class="p">:</span><span class="nb">float</span><span class="p">,</span> <span class="n">val</span><span class="p">:</span><span class="n">int64</span><span class="o">></span><span class="p">[</span><span class="n">row</span><span class="p">,</span> <span class="n">col</span><span class="p">];</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
</code></pre></div></div>
<p>Moreover, we assume that the number of columns is potentially large and avoid referring to or manipulating columns directly. We start by loading the entire file into a temporary array:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-2-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=3'</span><span class="p">,</span> <span class="s">'split_on_dimension=1'</span><span class="p">),</span>
<span class="n">rec_file</span><span class="p">);</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">,</span><span class="n">attribute_no</span><span class="p">}</span> <span class="n">a</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'1.1'</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="s">'1.3'</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="s">'1.5'</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">}</span> <span class="n">null</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'10'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="s">'20'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="s">'30'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">}</span> <span class="n">null</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'12'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="s">'22'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="s">'32'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">}</span> <span class="n">null</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'14'</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="s">'24'</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="s">'34'</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">}</span> <span class="n">null</span>
</code></pre></div></div>
<p>Next, we extract only the <em>header</em> of the table and store it in another array:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">between</span><span class="p">(</span>
<span class="n">rec_file</span><span class="p">,</span>
<span class="mi">0</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span>
<span class="mi">0</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="n">pos</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">a</span><span class="p">),</span>
<span class="n">col</span><span class="p">,</span> <span class="n">attribute_no</span><span class="p">),</span>
<span class="o"><</span><span class="n">pos</span><span class="p">:</span><span class="nb">float</span><span class="o">></span><span class="p">[</span><span class="n">col</span><span class="p">]),</span>
<span class="n">rec_head</span><span class="p">);</span>
<span class="p">{</span><span class="n">col</span><span class="p">}</span> <span class="n">pos</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="mf">1.1</span>
<span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="mf">1.3</span>
<span class="p">{</span><span class="mi">2</span><span class="p">}</span> <span class="mf">1.5</span>
</code></pre></div></div>
<p>Notice how we use the <code class="language-plaintext highlighter-rouge">between</code> operator to extract the values where <code class="language-plaintext highlighter-rouge">tuple_no</code> is <code class="language-plaintext highlighter-rouge">0</code> and we only extract the first three positions of the <code class="language-plaintext highlighter-rouge">attribute_no</code> dimension (<code class="language-plaintext highlighter-rouge">null</code> to <code class="language-plaintext highlighter-rouge">2</code>). We do the same for the <em>body</em> of the table:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">between</span><span class="p">(</span>
<span class="n">rec_file</span><span class="p">,</span>
<span class="mi">1</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span>
<span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a</span><span class="p">),</span>
<span class="n">row</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">col</span><span class="p">,</span> <span class="n">attribute_no</span><span class="p">),</span>
<span class="o"><</span><span class="n">val</span><span class="p">:</span><span class="n">int64</span><span class="o">></span><span class="p">[</span><span class="n">row</span><span class="p">,</span> <span class="n">col</span><span class="p">]),</span>
<span class="n">rec_body</span><span class="p">);</span>
<span class="p">{</span><span class="n">row</span><span class="p">,</span> <span class="n">col</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">30</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">12</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">22</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">32</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">14</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">24</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">34</span>
</code></pre></div></div>
<p>Finally, we obtain the desired array with the help of the <code class="language-plaintext highlighter-rouge">cross_join</code> operator (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/cross_join">documentation</a>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span><span class="n">rec_body</span><span class="p">,</span> <span class="n">rec_head</span><span class="p">,</span> <span class="n">rec_body</span><span class="p">.</span><span class="n">col</span><span class="p">,</span> <span class="n">rec_head</span><span class="p">.</span><span class="n">col</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
<span class="p">{</span><span class="n">row</span><span class="p">,</span><span class="n">col</span><span class="p">}</span> <span class="n">pos</span><span class="p">,</span><span class="n">val</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mf">1.1</span><span class="p">,</span><span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mf">1.3</span><span class="p">,</span><span class="mi">20</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mf">1.5</span><span class="p">,</span><span class="mi">30</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mf">1.1</span><span class="p">,</span><span class="mi">12</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mf">1.3</span><span class="p">,</span><span class="mi">22</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mf">1.5</span><span class="p">,</span><span class="mi">32</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mf">1.1</span><span class="p">,</span><span class="mi">14</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mf">1.3</span><span class="p">,</span><span class="mi">24</span>
<span class="p">{</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mf">1.5</span><span class="p">,</span><span class="mi">34</span>
</code></pre></div></div>
<p>For optimal performance, we place the larger array first when calling the <code class="language-plaintext highlighter-rouge">cross_join</code> operator. Notice that the columns start at index <code class="language-plaintext highlighter-rouge">0</code> but the rows start at index <code class="language-plaintext highlighter-rouge">1</code>. This is because the body of the table starts at index <code class="language-plaintext highlighter-rouge">1</code>. Adjusting the columns or the rows so that they start at the same index can be done using the <code class="language-plaintext highlighter-rouge">apply</code> operator. We leave this as an exercise to the reader.</p>
<h1 id="buffer-size-caveat">Buffer Size Caveat</h1>
<p>In the previous queries, we assume that the <code class="language-plaintext highlighter-rouge">tuple_no</code> values returned by the <code class="language-plaintext highlighter-rouge">aio_input</code> are unique and dense. Let us have a look at another example where the data file is slightly bigger:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cat rec-3.txt
</span><span class="mi">10</span> <span class="mi">20</span> <span class="mi">30</span>
<span class="mi">11</span> <span class="mi">21</span> <span class="mi">31</span>
<span class="mi">12</span> <span class="mi">22</span> <span class="mi">32</span>
<span class="mi">13</span> <span class="mi">23</span> <span class="mi">33</span>
<span class="mi">14</span> <span class="mi">24</span> <span class="mi">34</span>
<span class="mi">15</span> <span class="mi">25</span> <span class="mi">35</span>
</code></pre></div></div>
<p>We load this data using the same <code class="language-plaintext highlighter-rouge">aio_input</code> operator, but we specify a small value for the <code class="language-plaintext highlighter-rouge">buffer_size</code> parameter (see <a href="https://github.com/Paradigm4/accelerated_io_tools/tree/fd85a44849fe0aba285078f6cd999ab8c57560d7#tuning-settings">documentation</a>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-3.txt'</span><span class="p">,</span> <span class="s">'num_attributes=3'</span><span class="p">,</span> <span class="s">'buffer_size=10'</span><span class="p">);</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">}</span> <span class="n">a0</span><span class="p">,</span><span class="n">a1</span><span class="p">,</span><span class="n">a2</span><span class="p">,</span><span class="n">error</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'10'</span><span class="p">,</span><span class="s">'20'</span><span class="p">,</span><span class="s">'30'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'11'</span><span class="p">,</span><span class="s">'21'</span><span class="p">,</span><span class="s">'31'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'12'</span><span class="p">,</span><span class="s">'22'</span><span class="p">,</span><span class="s">'32'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">10</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'13'</span><span class="p">,</span><span class="s">'23'</span><span class="p">,</span><span class="s">'33'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">10</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'14'</span><span class="p">,</span><span class="s">'24'</span><span class="p">,</span><span class="s">'34'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">20</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'15'</span><span class="p">,</span><span class="s">'25'</span><span class="p">,</span><span class="s">'35'</span><span class="p">,</span><span class="n">null</span>
</code></pre></div></div>
<p>Let us examine the <code class="language-plaintext highlighter-rouge">tuple_no</code> values. They are neither unique nor dense. To understand what happened, we need to understand how the <code class="language-plaintext highlighter-rouge">buffer_size</code> parameter works. <code class="language-plaintext highlighter-rouge">buffer_size</code> specifies the size (in bytes) of a buffer used to split the input data. Each buffer of data is distributed across the cluster (in a round-robin fashion) and loaded in an array chunk. If the <code class="language-plaintext highlighter-rouge">chunk_size</code> parameter (see <a href="https://github.com/Paradigm4/accelerated_io_tools/tree/fd85a44849fe0aba285078f6cd999ab8c57560d7#tuning-settings">documentation</a>) is not explicitly specified, it is set to the value of the <code class="language-plaintext highlighter-rouge">block_size</code>.</p>
<p>In our example, the <code class="language-plaintext highlighter-rouge">buffer_size</code> is set to <code class="language-plaintext highlighter-rouge">10 bytes</code>, the <code class="language-plaintext highlighter-rouge">chunk_size</code> is implicitly set to <code class="language-plaintext highlighter-rouge">10</code>, and we have two SciDB instances in our cluster. As a consequence, the first two lines in the file fill the first buffer and are loaded by the first SciDB instance, in its first array chunk. Notice the <code class="language-plaintext highlighter-rouge">tuple_no</code> set to <code class="language-plaintext highlighter-rouge">0</code> and <code class="language-plaintext highlighter-rouge">1</code>, respectively, and the <code class="language-plaintext highlighter-rouge">dst_instance_id</code> set to <code class="language-plaintext highlighter-rouge">0</code>. The third line in the file fills the second buffer and it is loaded by the second SciDB instance, in its first array chunk. Next, the fourth line is loaded by the first SciDB instance in its second array chunk. Notice the <code class="language-plaintext highlighter-rouge">tuple_no</code> set to <code class="language-plaintext highlighter-rouge">10</code> (because <code class="language-plaintext highlighter-rouge">chunk_size</code> is <code class="language-plaintext highlighter-rouge">10</code>) and the <code class="language-plaintext highlighter-rouge">dst_instance_id</code> set to <code class="language-plaintext highlighter-rouge">0</code>. The process continues with the fifth line loaded on the second instance, in its second chunk, and the sixth line loaded on the first instance, in its third chunk. The table below shows in which instance and chunk each line ends up:</p>
<table>
<tbody>
<tr>
<td><strong>Chunk\Instance</strong></td>
<td><code class="language-plaintext highlighter-rouge">0</code></td>
<td><code class="language-plaintext highlighter-rouge">1</code></td>
</tr>
<tr>
<td><strong>1st</strong> <code class="language-plaintext highlighter-rouge">{0,...</code></td>
<td><code class="language-plaintext highlighter-rouge">'10',...</code> <br /> <code class="language-plaintext highlighter-rouge">'11',...</code></td>
<td><code class="language-plaintext highlighter-rouge">'12',...</code></td>
</tr>
<tr>
<td><strong>2nd</strong> <code class="language-plaintext highlighter-rouge">{10,...</code></td>
<td><code class="language-plaintext highlighter-rouge">'13',...</code></td>
<td><code class="language-plaintext highlighter-rouge">'14',...</code></td>
</tr>
<tr>
<td><strong>3rd</strong> <code class="language-plaintext highlighter-rouge">{20,...</code></td>
<td><code class="language-plaintext highlighter-rouge">'15',...</code></td>
<td> </td>
</tr>
</tbody>
</table>
<p>We use this data-loading logic in order to generate a unique and dense set of values associated with each row (corresponding to their original order in the input file).
we construct three new attributes using the <code class="language-plaintext highlighter-rouge">tuple_no</code> and <code class="language-plaintext highlighter-rouge">dst_instance_id</code> dimensions and <code class="language-plaintext highlighter-rouge">sort</code> the array (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/sort">documentation</a>) on these new attributes:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">sort</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-3.txt'</span><span class="p">,</span> <span class="s">'num_attributes=3'</span><span class="p">,</span> <span class="s">'buffer_size=10'</span><span class="p">),</span>
<span class="n">chunk_no</span><span class="p">,</span> <span class="n">tuple_no</span> <span class="o">/</span> <span class="mi">10</span><span class="p">,</span>
<span class="n">inst</span><span class="p">,</span> <span class="n">dst_instance_id</span><span class="p">,</span>
<span class="n">chunk_idx</span><span class="p">,</span> <span class="n">tuple_no</span> <span class="o">%</span> <span class="mi">10</span><span class="p">),</span>
<span class="n">chunk_no</span><span class="p">,</span> <span class="n">inst</span><span class="p">,</span> <span class="n">chunk_idx</span><span class="p">);</span>
<span class="p">{</span><span class="n">n</span><span class="p">}</span> <span class="n">a0</span><span class="p">,</span><span class="n">a1</span><span class="p">,</span><span class="n">a2</span><span class="p">,</span><span class="n">error</span><span class="p">,</span><span class="n">chunk_no</span><span class="p">,</span><span class="n">inst</span><span class="p">,</span><span class="n">chunk_idx</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'10'</span><span class="p">,</span><span class="s">'20'</span><span class="p">,</span><span class="s">'30'</span><span class="p">,</span><span class="n">null</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="s">'11'</span><span class="p">,</span><span class="s">'21'</span><span class="p">,</span><span class="s">'31'</span><span class="p">,</span><span class="n">null</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span>
<span class="p">{</span><span class="mi">2</span><span class="p">}</span> <span class="s">'12'</span><span class="p">,</span><span class="s">'22'</span><span class="p">,</span><span class="s">'32'</span><span class="p">,</span><span class="n">null</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">3</span><span class="p">}</span> <span class="s">'13'</span><span class="p">,</span><span class="s">'23'</span><span class="p">,</span><span class="s">'33'</span><span class="p">,</span><span class="n">null</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">4</span><span class="p">}</span> <span class="s">'14'</span><span class="p">,</span><span class="s">'24'</span><span class="p">,</span><span class="s">'34'</span><span class="p">,</span><span class="n">null</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">5</span><span class="p">}</span> <span class="s">'15'</span><span class="p">,</span><span class="s">'25'</span><span class="p">,</span><span class="s">'35'</span><span class="p">,</span><span class="n">null</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span>
</code></pre></div></div>
<p>The first new attribute, <code class="language-plaintext highlighter-rouge">chunk_no</code> keeps track of the chunk number to which a record belongs to. <code class="language-plaintext highlighter-rouge">inst</code> keeps track of the instance on which a record was loaded on. Finally, <code class="language-plaintext highlighter-rouge">chunk_idx</code> keeps track of the index of a record inside a specific chunk at a specific instance. The value <code class="language-plaintext highlighter-rouge">10</code> represents the implicit <code class="language-plaintext highlighter-rouge">chunk_size</code> used in our example. The result of the <code class="language-plaintext highlighter-rouge">sort</code> operator is a one-dimensional array where dimension <code class="language-plaintext highlighter-rouge">n</code> stores the unique and dense record indexes we need.</p>
<p>The default <code class="language-plaintext highlighter-rouge">buffer_size</code> is <code class="language-plaintext highlighter-rouge">8MB</code> and the default <code class="language-plaintext highlighter-rouge">chunk_size</code> is <code class="language-plaintext highlighter-rouge">10,000,000</code>. Depending on the input data and the values set for these parameters, the <code class="language-plaintext highlighter-rouge">tuple_no</code> dimension might end up containing unique and dense values and could be used directly, but this should be verified.</p>
<p>The input data and the queries are available <a href="https://github.com/rvernica/scidb-examples/tree/master/data-load-table">here</a>.</p>Rares VernicaIn part 1 and part 2 of this multi-part post, we looked a how to load data from multiple files while capturing information present in the file name. In this post, we look at how to load data files organized as tables with a possibly large number of columns and header rows.SciDB Tutorials2016-07-15T00:00:00+00:002016-07-15T00:00:00+00:00http://rvernica.github.io/2016/07/tutorials<p>SciDB has extensive <a href="https://paradigm4.atlassian.net/wiki/display/ESD/SciDB+Documentation">documentation</a> but there is no official tutorial or getting started guide. In this post, we go over some of the tutorials and getting started materials available online.</p>
<p>Please note that some of the materials listed here are more than two years old and specific functions or commands might have been deprecated and replaced in newer versions of SciDB. One example is the <code class="language-plaintext highlighter-rouge">count</code> operator which was <a href="http://www.paradigm4.com/HTMLmanual/13.12/scidb_ug/pr01s02.html">deprecated</a> in version <code class="language-plaintext highlighter-rouge">13.12</code> and <a href="http://paradigm4.com/HTMLmanual/14.3/scidb_ug/pr01s05.html">removed</a> in version <code class="language-plaintext highlighter-rouge">14.3</code>. Now, <code class="language-plaintext highlighter-rouge">aggregate(..., count(...))</code> is used. We decided to still include old materials for the educational value they provide.</p>
<h1 id="scidb">SciDB</h1>
<ul>
<li>Demo Video: <a href="https://www.youtube.com/watch?v=xogpgiZUlT8">What’s New in SciDB 15.12 </a>, Alex Poliakov, <code class="language-plaintext highlighter-rouge">~8min</code>, April 2016</li>
<li><strong>SciDB Tutorial</strong>, Alex Poliakov and Paul G. Brown, October 2013
<ul>
<li>Slides: <a href="http://forum.paradigm4.com/uploads/db6652/original/1X/2d7e2bdbeb09ea421bb5f79e2996110db1eb2faa.pptx">PPT</a>, <a href="http://forum.paradigm4.com/uploads/db6652/original/1X/2b9abb61ce647ec70abf08c718036073f51d5965.pdf">PDF</a></li>
<li>Video: <a href="https://www.youtube.com/watch?v=SsF_Mke0Mlw">SciDB Tutorial at XLDB 2013</a>, <code class="language-plaintext highlighter-rouge">~3hours</code>. Below are links to different sections of the tutorial within the video:
<ul>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=268">Introductory Demo: MODIS Regridding</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=845">The case for SciDB</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=1096">Configuring and Installing SciDB, SciDB-R and Shim</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=2124">Loading Data (TCGA LAML Methylation)</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=2980">The Multidimensional Array Data Model</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=4545">Redimensioning MODIS Data, Calculating Chunk Sizes</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=5215">Non-integer Dimensions and Synthetic Dimensions on Trade Data</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=5666">Query Structure, Operators</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=5908">Filtering Operators: project, filter, between, slice, subarray</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=6870">Apply and Aggregates: Grand, Grouped, Windows</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=7871">Binary Operators: join, merge, cross, cross_join</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=8660">Scalapack Math: gemm, gesvd</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=8873">Canceling Queries</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=8957">Adding Data to Existing Arrays: store and insert</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=9345">PCA Example: LAML Methylation</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=9906">Revisiting the MODIS Demo</a></li>
<li><a href="https://youtu.be/SsF_Mke0Mlw?t=10040">User-Defined Objects, Overloading R Syntax with SciDB Queries</a></li>
</ul>
</li>
</ul>
</li>
<li>Data Loading Tutorial, Paul G. Brown, March 2015
<ul>
<li>Slides: <a href="https://docs.google.com/presentation/d/1eEU0G7OM7ag58Ulrk-2uyDjr7qsqvWAqBEVXpcVPVZg">Load Binary Data into SciDB</a></li>
<li>Example: <a href="http://forum.paradigm4.com/t/a-non-definitive-guide-to-data-loading-in-scidb/760">A (Non-definitive) Guide to Data Loading in SciDB</a></li>
</ul>
</li>
</ul>
<h1 id="scidb-and-r">SciDB and R</h1>
<p>Paradigm4 provides a SciDB package for R. The package can be found on <a href="https://github.com/Paradigm4/SciDBR">GitHub</a>. Below are some demo and tutorial videos on how to use this package:</p>
<ul>
<li>Demo Video: <a href="https://www.youtube.com/watch?v=aY9koMvo2OU"><em>Interactive Data Exploration with SciDB</em></a>, Alex Poliakov, <code class="language-plaintext highlighter-rouge">~8min</code>, March 2015 <br />
<em>A Shinyapp visualization using 1000 Genomes data.</em></li>
<li>Tutorial Video: <a href="https://www.youtube.com/watch?v=ggFTCD5DiZc"><em>Querying and Visualizing SciDB data with R</em></a>, Bryan Lewis, <code class="language-plaintext highlighter-rouge">~3min</code>, January 2013 <br />
<em>Bryan Lewis shows how to query SciDB data with R data.frame iterators and visualize it with R graphics tools.</em></li>
<li>Tutorial Video: <a href="https://www.youtube.com/watch?v=ak4hFX8hrt4"><em>An Innovative Integration of R and SciDB</em></a>, Bryan Lewis, <code class="language-plaintext highlighter-rouge">~3min</code>, January 2013<br />
<em>Link up the ease-of-use of R with seamlessly integrated data management and a massively scalable math library to build solutions that scale from prototype to production without rewriting any code.</em></li>
</ul>
<h1 id="scidb-and-python">SciDB and Python</h1>
<p>Paradigm4 also provides a SciDB package for Python. The package can be found on <a href="https://github.com/Paradigm4/SciDB-Py">GitHub</a> as well. Below are some tutorial videos on how to use this package:</p>
<ul>
<li>Tutorial Video: <a href="https://www.youtube.com/watch?v=qIHibmjhrHU"><em>Timeseries Data in SciDB</em></a>, Bryan Lewis and Jake Vanderplas, <code class="language-plaintext highlighter-rouge">~28min</code>, February 2015 <br />
<em>We’ll demonstrate working with timeseries data in SciDB and present basic examples that illustrate SciDB’s native analytics capabilities including aggregation and data decimation, regression and generalized linear models, covariance matrices, singular value decomposition, and extending SciDB with custom operations. The examples apply to a broad range of applications including quantitative finance, econometrics, and risk and credit analysis.</em></li>
<li>Tutorial Video: <a href="https://www.youtube.com/watch?v=qwnS6t5ekUY"><em>Using SciDB from Python</em></a>, Bryan Lewis, Travis Oliphant, and Jake Vanderplas, <code class="language-plaintext highlighter-rouge">~56min</code>, July 2013</li>
</ul>
<p>Various other demo and overview videos are available on the Paradigm4 <a href="https://www.youtube.com/user/paradigm4inc/">YouTube channel</a>.</p>Rares VernicaSciDB has extensive documentation but there is no official tutorial or getting started guide. In this post, we go over some of the tutorials and getting started materials available online.Keep an Eye on the Chunk Length2016-07-01T00:00:00+00:002016-07-01T00:00:00+00:00http://rvernica.github.io/2016/07/chunk-slice<p>To define a SciDB array we need to specify its dimensions. For each dimension, we need to specify its name, low value, high value, chunk length and chunk overlap (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Array+Dimensions">documentation</a>). The <em>chunk</em> parameters are somehow internal to SciDB and affect its performance. In this post, we look at a simple example where being careless about the chunk length gets us in trouble very fast.</p>
<p>When we are starting of with SciDB, we might ignore the <em>chunk length</em> parameter when declaring array dimensions. We can use the default values or specify some large values for the chunk length. For example:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">create</span> <span class="n">array</span> <span class="n">foo</span><span class="o"><</span><span class="n">x</span><span class="p">:</span><span class="n">int64</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">show</span><span class="p">(</span><span class="n">foo</span><span class="p">);</span>
<span class="p">{</span><span class="n">i</span><span class="p">}</span> <span class="n">schema</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'foo<x:int64> [i=0:*,1000000,0]'</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">create</span> <span class="n">array</span> <span class="n">bar</span><span class="o"><</span><span class="n">x</span><span class="p">:</span><span class="n">int64</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="n">j</span><span class="p">];</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">show</span><span class="p">(</span><span class="n">bar</span><span class="p">);</span>
<span class="p">{</span><span class="n">i</span><span class="p">}</span> <span class="n">schema</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'bar<x:int64> [i=0:*,1000,0,j=0:*,1000,0]'</span>
</code></pre></div></div>
<p>As we can see, the default chunk length is <code class="language-plaintext highlighter-rouge">1,000,000</code> split across dimensions, so <code class="language-plaintext highlighter-rouge">1,000,000</code> for a one dimension array, <code class="language-plaintext highlighter-rouge">1,000</code> for each dimension for a two dimensions array, etc. This is probably not a big problem for most of the operators.</p>
<h1 id="a-few-joins">A Few Joins</h1>
<p>Let’s add two records into the <code class="language-plaintext highlighter-rouge">foo</code> array and cross-join it two times. We store the result in a new array, <code class="language-plaintext highlighter-rouge">taz</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">x</span><span class="p">:</span><span class="n">int64</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">1</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="n">i</span><span class="p">),</span>
<span class="n">foo</span><span class="p">),</span>
<span class="n">foo</span><span class="p">);</span>
<span class="p">{</span><span class="n">i</span><span class="p">}</span> <span class="n">x</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="mi">0</span>
<span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="mi">1</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span><span class="n">foo</span><span class="p">,</span> <span class="n">foo</span><span class="p">),</span>
<span class="n">foo</span><span class="p">),</span>
<span class="n">taz</span><span class="p">);</span>
<span class="p">{</span><span class="n">i</span><span class="p">,</span><span class="n">i_2</span><span class="p">,</span><span class="n">i_3</span><span class="p">}</span> <span class="n">x</span><span class="p">,</span><span class="n">x_2</span><span class="p">,</span><span class="n">x_3</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span>
</code></pre></div></div>
<h1 id="chocking-the-slice-operator">Chocking the <code class="language-plaintext highlighter-rouge">slice</code> Operator</h1>
<p>Now, if we try to <code class="language-plaintext highlighter-rouge">slice</code> (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/slice">documentation</a>), the <code class="language-plaintext highlighter-rouge">taz</code> array, by holding off one dimension, we get into trouble:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="nb">slice</span><span class="p">(</span><span class="n">taz</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span> <span class="o">--</span> <span class="n">takes</span> <span class="n">a</span> <span class="n">very</span> <span class="nb">long</span> <span class="n">time</span> <span class="n">to</span> <span class="n">finish</span>
</code></pre></div></div>
<p>On our SciDB instance, the query did not complete after running for a few hours with <code class="language-plaintext highlighter-rouge">100%</code> CPU usage. We had to restart the database in order to stop it. We assume the query would eventually end.</p>
<p>The <code class="language-plaintext highlighter-rouge">taz</code> array has only <code class="language-plaintext highlighter-rouge">8</code> records. The problem is not the number of records in the array, but the chunk lengths. The original <code class="language-plaintext highlighter-rouge">foo</code> array, has one dimension with chunk length <code class="language-plaintext highlighter-rouge">1,000,000</code>. The <code class="language-plaintext highlighter-rouge">taz</code> array has three dimensions with chunk length <code class="language-plaintext highlighter-rouge">1,000,000</code>. The <code class="language-plaintext highlighter-rouge">slice</code> operator might try to allocate memory to hold a two-dimensional array (since we slice on one of the dimensions) with chunk length <code class="language-plaintext highlighter-rouge">1,000,000</code> in each dimension. This is probably too large and a lot of memory swapping might take place. All of this happens for just <code class="language-plaintext highlighter-rouge">8</code> records. Here is the schema for the <code class="language-plaintext highlighter-rouge">taz</code> array:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">show</span><span class="p">(</span><span class="n">taz</span><span class="p">);</span>
<span class="p">{</span><span class="n">i</span><span class="p">}</span> <span class="n">schema</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'taz<x:int64,x_2:int64,x_3:int64> [i=0:*,1000000,0,i_2=0:*,1000000,0,i_3=0:*,1000000,0]'</span>
</code></pre></div></div>
<p>So, starting from the default chunk length and a few joins, the <code class="language-plaintext highlighter-rouge">slice</code> operator can get us in trouble really fast, even if we only have a hand full of records in the array. We recommend keeping an eye on the chunk length and its multiplicative effect across dimensions.</p>
<h1 id="alternatives-to-the-slice-operator">Alternatives to the <code class="language-plaintext highlighter-rouge">slice</code> Operator</h1>
<p>If large chunk lengths across dimensions cannot be avoided, we recommend using <code class="language-plaintext highlighter-rouge">between</code> (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/between">documentation</a>) and <code class="language-plaintext highlighter-rouge">redimension</code> (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/redimension">documentation</a>) instead of <code class="language-plaintext highlighter-rouge">slice</code>. The same slicing operation we tried before can be achieved with:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">redimension</span><span class="p">(</span>
<span class="n">between</span><span class="p">(</span>
<span class="n">taz</span><span class="p">,</span>
<span class="mi">0</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span>
<span class="mi">0</span><span class="p">,</span> <span class="n">null</span><span class="p">,</span> <span class="n">null</span><span class="p">),</span>
<span class="o"><</span><span class="n">x</span><span class="p">:</span><span class="n">int64</span><span class="p">,</span><span class="n">x_2</span><span class="p">:</span><span class="n">int64</span><span class="p">,</span><span class="n">x_3</span><span class="p">:</span><span class="n">int64</span><span class="o">></span> <span class="p">[</span><span class="n">i_2</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="o">*</span><span class="p">,</span><span class="mi">1000000</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="n">i_3</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="o">*</span><span class="p">,</span><span class="mi">1000000</span><span class="p">,</span><span class="mi">0</span><span class="p">]);</span>
<span class="p">{</span><span class="n">i_2</span><span class="p">,</span><span class="n">i_3</span><span class="p">}</span> <span class="n">x</span><span class="p">,</span><span class="n">x_2</span><span class="p">,</span><span class="n">x_3</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span>
</code></pre></div></div>
<p>These example queries are available <a href="https://github.com/rvernica/scidb-examples/tree/master/chunk-slice">here</a>.</p>Rares VernicaTo define a SciDB array we need to specify its dimensions. For each dimension, we need to specify its name, low value, high value, chunk length and chunk overlap (see documentation). The chunk parameters are somehow internal to SciDB and affect its performance. In this post, we look at a simple example where being careless about the chunk length gets us in trouble very fast.Unleashing SciDB in a Docker Container2016-06-15T00:00:00+00:002016-06-15T00:00:00+00:00http://rvernica.github.io/2016/06/docker-image<p><a href="https://www.docker.com/">Docker</a> containers simplify the development and deployment of software through isolation. Containers are especially useful when the targeted software has a complicated installation procedure. This is the case for SciDB. SciDB needs to be compiled from source and requires a multitude of libraries and development tools to be installed. Moreover, SciDB is pretty <a href="https://paradigm4.atlassian.net/wiki/display/ESD/SciDB+Community+Edition+Installation+Guide#SciDBCommunityEditionInstallationGuide-Requirements">specific</a> on the type of operating system it supports. In this post, we look into how to create a Docker <a href="https://docs.docker.com/engine/reference/glossary/#/image"><em>image</em></a> for SciDB. The image can be used to launch Docker <a href="https://docs.docker.com/engine/reference/glossary/#/container"><em>containers</em></a>. The containers run isolated from the host operating system and on a multitude of operating systems, including <a href="https://docs.docker.com/docker-for-windows/">Windows</a>. We assume the reader has some familiarity with Docker and focus on SciDB particularities.</p>
<p>Note: The Docker image described in this post, is for SciDB <code class="language-plaintext highlighter-rouge">15.12</code> and for a single node installation.</p>
<h1 id="dockerfiles">Dockerfiles</h1>
<p>The easiest way to build a Docker image is to start from an existing Docker image, instantiate a container with it, make changes to the container, and commit the updated container as a new image. Although easy, this method is not very common because it is not very portable or reproducible.</p>
<p>The preferred way of building Docker images is to create a <a href="https://docs.docker.com/engine/reference/builder/"><em>Dockerfile</em></a>. Dockerfiles are sequences of instructions that can be executed by the Docker <em>builder</em> to create a new image starting from a pre-existing image. In this post, we create a Dockerfile to build our SciDB Docker image.</p>
<h1 id="getting-started">Getting Started</h1>
<p>To build a Docker image for SciDB, we follow very closely the official <a href="https://paradigm4.atlassian.net/wiki/display/ESD/SciDB+Community+Edition+Installation+Guide">SciDB Community Edition Installation Guide</a>. Most of the steps present in the installation guide are reflected in our SciDB Dockerfile. The first section in this guide is the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/SciDB+Community+Edition+Installation+Guide#SciDBCommunityEditionInstallationGuide-Requirements">Requirements</a> section. From the recommended operating systems, we chose the <a href="http://www.ubuntu.com/">Ubuntu</a> Linux distribution, version <code class="language-plaintext highlighter-rouge">14.04</code>. We use the official Docker image for Ubuntu available on <a href="https://hub.docker.com/_/ubuntu/">Docker Hub</a>. To start off, our Dockerfile looks like this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Requirements</span>
<span class="c">## ---</span>
FROM ubuntu:14.04
RUN apt-get update
RUN apt-get <span class="nb">install</span> <span class="nt">-y</span> wget apt-transport-https software-properties-common
</code></pre></div></div>
<p>Lines starting with <code class="language-plaintext highlighter-rouge">#</code> denote comments and are ignored. The <code class="language-plaintext highlighter-rouge">FROM</code> statement (see <a href="https://docs.docker.com/engine/reference/builder/#/from">Docker documentation</a>) indicates the base image for Dockerfile, in this case, <code class="language-plaintext highlighter-rouge">ubuntu:14.04</code>. Next, we fetch the Ubuntu package index and install a few packages using the <code class="language-plaintext highlighter-rouge">RUN</code> statement (see <a href="https://docs.docker.com/engine/reference/builder/#/run">Docker documentation</a>). These packages are not mentioned in the Installation Guide but are needed for a successful installation.</p>
<p>To build a Docker image we need to place the lines above into a file called <code class="language-plaintext highlighter-rouge">Dockerfile</code> and run the <code class="language-plaintext highlighter-rouge">docker build</code> command (see <a href="https://docs.docker.com/engine/reference/commandline/build/">Docker documentation</a>):</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat</span> <span class="o">></span> Dockerfile
FROM ubuntu:14.04
RUN apt-get update
RUN apt-get <span class="nb">install</span> <span class="nt">-y</span> wget apt-transport-https software-properties-common
^D
<span class="nv">$ </span>docker build <span class="nt">--tag</span> scidb <span class="nb">.</span>
Sending build context to Docker daemon 2.048 kB
Step 1 : FROM ubuntu:14.04
<span class="nt">---</span><span class="o">></span> 38c759202e30
Step 2 : RUN apt-get update
<span class="nt">---</span><span class="o">></span> Running <span class="k">in </span>f4fb4aa2958e
...
<span class="nt">---</span><span class="o">></span> ad6c2dea1b62
Removing intermediate container f4fb4aa2958e
Step 3 : RUN apt-get <span class="nb">install</span> <span class="nt">-y</span> wget apt-transport-https software-properties-common
<span class="nt">---</span><span class="o">></span> Running <span class="k">in </span>3822490df93f
...
<span class="nt">---</span><span class="o">></span> e521dbf5055a
Removing intermediate container 3822490df93f
Successfully built e521dbf5055a
</code></pre></div></div>
<p>Notice how the build process has three steps, one for each of the statements in the Dockerfile. After each step, Docker generates and stores an intermediary image. In some steps (e.g., step 2 and step 3 above) Docker creates and uses an intermediary container which is later removed. As we update the Dockerfile, we can re-run the build process to see the effects of our changes. Running the build process multiple times does <em>not</em> lead to re-execution of a step if that step and any of the steps before it have not changed. This is a benefit of the Docker <a href="https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/#/build-cache">caching mechanism</a>.</p>
<p>Next, we define a few environment variables using the <code class="language-plaintext highlighter-rouge">ARG</code> statement (see <a href="https://docs.docker.com/engine/reference/builder/#/arg">Docker documentation</a>), and create a <code class="language-plaintext highlighter-rouge">scidb</code> user, as instructed in the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/SciDB+Community+Edition+Installation+Guide#SciDBCommunityEditionInstallationGuide-InstallationNotes">Installation Notes</a>:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Installation Notes</span>
<span class="c">## ---</span>
ARG <span class="nv">host_ip</span><span class="o">=</span>127.0.0.1
ARG <span class="nv">net_mask</span><span class="o">=</span><span class="nv">$host_ip</span>/8
ARG <span class="nv">scidb_usr</span><span class="o">=</span>scidb
ARG <span class="nv">dev_dir</span><span class="o">=</span>/usr/src
RUN groupadd <span class="nv">$scidb_usr</span>
RUN useradd <span class="nv">$scidb_usr</span> <span class="nt">-s</span> /bin/bash <span class="nt">-m</span> <span class="nt">-g</span> <span class="nv">$scidb_usr</span>
</code></pre></div></div>
<h1 id="pre-installation-tasks">Pre-Installation Tasks</h1>
<p>Now we address the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Pre-Installation+Tasks">Pre-Installation Tasks</a> required for building and installing SciDB. First, we download and extract the SciDB source code:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Download SciDB Community Edition</span>
<span class="c">## ---</span>
WORKDIR <span class="nv">$dev_dir</span>
ARG <span class="nv">scidb_url</span><span class="o">=</span><span class="s2">"https://docs.google.com/uc?id=0B7yt0n33Us0raWtCYmNlZWRxWG8&export=download"</span>
RUN wget <span class="nt">--no-verbose</span> <span class="nt">--output-document</span> scidb-15.12.1.4cadab5.tar.gz <span class="se">\</span>
<span class="nt">--load-cookies</span> cookies.txt <span class="se">\</span>
<span class="s2">"</span><span class="nv">$scidb_url</span><span class="s2">&</span><span class="sb">`</span>wget <span class="nt">--no-verbose</span> <span class="nt">--output-document</span> - <span class="se">\</span>
<span class="nt">--save-cookies</span> cookies.txt <span class="s2">"</span><span class="nv">$scidb_url</span><span class="s2">"</span> | <span class="se">\</span>
<span class="nb">grep</span> <span class="nt">--only-matching</span> <span class="s1">'confirm=[^&]*'</span><span class="sb">`</span><span class="s2">"</span>
RUN <span class="nb">tar</span> <span class="nt">-xzf</span> scidb-15.12.1.4cadab5.tar.gz
RUN <span class="nb">mv </span>scidb-15.12.1.4cadab5 scidbtrunk
WORKDIR <span class="nv">$dev_dir</span>/scidbtrunk
<span class="c">## Installing Expect, and SSH Packages</span>
<span class="c">## --</span>
RUN apt-get <span class="nb">install</span> <span class="nt">-y</span> expect openssh-server openssh-client
</code></pre></div></div>
<p>The official SciDB source code location is on <a href="https://drive.google.com/folderview?id=0B7yt0n33Us0rT1FJdmxFV2g0OHc&usp=drive_web#list">Google Drive</a>. In order to download a file from Google Drive we have to make two requests. The first request is to obtain some cookies and a confirmation code which are used in the second request. The <code class="language-plaintext highlighter-rouge">WORKDIR</code> statements (see <a href="https://docs.docker.com/engine/reference/builder/#/workdir">Docker documentation</a>) are used to set the current directory, initially <code class="language-plaintext highlighter-rouge">/usr/src</code> and later <code class="language-plaintext highlighter-rouge">/usr/src/scidbtrunk</code>. We also install a few more packages as instructed by in the installation guide in the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Pre-Installation+Tasks#Pre-InstallationTasks-InstallingExpect,andSSHPackages">Installing Expect, and SSH Packages</a> section.</p>
<h2 id="password-less-ssh">Password-less SSH</h2>
<p>We setup and test password-less SSH as instructed in the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Pre-Installation+Tasks#Pre-InstallationTasks-InstallingExpect,andSSHPackages">Providing Passwordless SSH</a> section:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Providing Passwordless SSH</span>
<span class="c">## ---</span>
RUN ssh-keygen <span class="nt">-f</span> /root/.ssh/id_rsa <span class="nt">-N</span> <span class="s1">''</span>
RUN <span class="nb">chmod </span>755 /root
RUN <span class="nb">chmod </span>755 /root/.ssh
RUN <span class="nb">mkdir</span> /home/<span class="nv">$scidb_usr</span>/.ssh
RUN ssh-keygen <span class="nt">-f</span> /home/<span class="nv">$scidb_usr</span>/.ssh/id_rsa <span class="nt">-N</span> <span class="s1">''</span>
RUN <span class="nb">chmod </span>755 /home/<span class="nv">$scidb_usr</span>
RUN <span class="nb">chmod </span>755 /home/<span class="nv">$scidb_usr</span>/.ssh
<span class="c">## Avoid setting password and providing it to "deploy.sh access"</span>
RUN <span class="nb">cat</span> /root/.ssh/id_rsa.pub <span class="o">>></span> /root/.ssh/authorized_keys
RUN <span class="nb">cat</span> /root/.ssh/id_rsa.pub <span class="o">>></span> /home/<span class="nv">$scidb_usr</span>/.ssh/authorized_keys
<span class="c">## Set correct ownership</span>
RUN <span class="nb">chown</span> <span class="nt">-R</span> <span class="nv">$scidb_usr</span>:<span class="nv">$scidb_usr</span> /home/<span class="nv">$scidb_usr</span>
RUN service ssh start <span class="o">&&</span> <span class="se">\</span>
./deployment/deploy.sh access root NA <span class="s2">""</span> <span class="nv">$host_ip</span> <span class="o">&&</span> <span class="se">\</span>
./deployment/deploy.sh access <span class="nv">$scidb_usr</span> NA <span class="s2">""</span> <span class="nv">$host_ip</span> <span class="o">&&</span> <span class="se">\</span>
ssh <span class="nv">$host_ip</span> <span class="nb">date</span>
</code></pre></div></div>
<p>This set of steps is a bit more convoluted. Let’s go over it one at a time. We first generate SSH keys for both the <code class="language-plaintext highlighter-rouge">root</code> and the <code class="language-plaintext highlighter-rouge">scidb</code> accounts. Next, we authorize the public key of the <code class="language-plaintext highlighter-rouge">root</code> account on both the <code class="language-plaintext highlighter-rouge">root</code> and the <code class="language-plaintext highlighter-rouge">scidb</code> accounts. This would allow us to run the following <code class="language-plaintext highlighter-rouge">deploy.sh</code> script without needing to provide the account passwords for the <code class="language-plaintext highlighter-rouge">root</code> and <code class="language-plaintext highlighter-rouge">scidb</code> accounts. In fact, part of the what the <code class="language-plaintext highlighter-rouge">deploy.sh</code> script does it to authorize these keys.</p>
<p>Next, the installation guide instructs us to start the SSH server. Since Docker uses containers to build images, starting a server in one container has no effect on subsequent containers. Any running servers are killed when the container is saved as an image. Instead, we are starting any required servers in the exact container where they are needed. We do this using a <code class="language-plaintext highlighter-rouge">RUN</code> statement with multiple commands. We first start the SSH server and then run the <code class="language-plaintext highlighter-rouge">deploy.sh</code> scripts.</p>
<h2 id="build-tools-and-postgresql">Build Tools and PostgreSQL</h2>
<p>Next, we install the SciDB build tools using the instructions in the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Pre-Installation+Tasks#Pre-InstallationTasks-InstallingBuildTools">Installing Build Tools</a> section of the guide:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Installing Build Tools</span>
<span class="c">## ---</span>
RUN service ssh start <span class="o">&&</span> <span class="se">\</span>
./deployment/deploy.sh prepare_toolchain <span class="nv">$host_ip</span>
</code></pre></div></div>
<p>The installation is done by the <code class="language-plaintext highlighter-rouge">deploy.sh</code> script using a remote shell. So, in order for the script to work, we need to start the SSH server again in the same container where the script runs.</p>
<p>The final step in the pre-installation section is to install and configure the PostgreSQL database software. SciDB uses PostgreSQL to store its catalog. We use the <code class="language-plaintext highlighter-rouge">deploy.sh</code> script as instructed in the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Pre-Installation+Tasks#Pre-InstallationTasks-InstallingPostgres">Installing Postgres</a> section:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Installing Postgres</span>
<span class="c">## ---</span>
RUN service ssh start <span class="o">&&</span> <span class="se">\</span>
./deployment/deploy.sh prepare_postgresql postgres postgres <span class="nv">$net_mask</span> <span class="nv">$host_ip</span>
<span class="c">## Providing the postgres user Access to SciDB Code</span>
RUN usermod <span class="nt">-G</span> <span class="nv">$scidb_usr</span> <span class="nt">-a</span> postgres
RUN <span class="nb">chmod </span>g+rx <span class="nv">$dev_dir</span>
RUN /usr/bin/sudo <span class="nt">-u</span> postgres <span class="nb">ls</span> <span class="nv">$dev_dir</span>
</code></pre></div></div>
<p>We also make sure that the <code class="language-plaintext highlighter-rouge">postgres</code> user belongs to the same group as the <code class="language-plaintext highlighter-rouge">scidb</code> user and has access to the SciDB installation location.</p>
<h1 id="building-and-installing-scidb">Building and Installing SciDB</h1>
<p>We are ready to build and install SciDB as advised in the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Installing+SciDB+Community+Edition">Installing SciDB Community Edition</a> section of the SciDB installation guide. First, we configure the environment as in the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Installing+SciDB+Community+Edition#InstallingSciDBCommunityEdition-ConfiguringEnvironmentVariables">Configuring Environment Variables</a> section:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Configuring Environment Variables</span>
<span class="c">## ---</span>
ENV <span class="nv">SCIDB_VER</span><span class="o">=</span>15.12
ENV <span class="nv">SCIDB_INSTALL_PATH</span><span class="o">=</span><span class="nv">$dev_dir</span>/scidbtrunk/stage/install
ENV <span class="nv">SCIDB_BUILD_TYPE</span><span class="o">=</span>Debug
ENV <span class="nv">PATH</span><span class="o">=</span><span class="nv">$SCIDB_INSTALL_PATH</span>/bin:<span class="nv">$PATH</span>
RUN <span class="nb">echo</span> <span class="s2">"</span><span class="se">\</span><span class="s2">
export SCIDB_VER=</span><span class="nv">$SCIDB_VER</span><span class="se">\n\</span><span class="s2">
export SCIDB_INSTALL_PATH=</span><span class="nv">$SCIDB_INSTALL_PATH</span><span class="se">\n\</span><span class="s2">
export SCIDB_BUILD_TYPE=</span><span class="nv">$SCIDB_BUILD_TYPE</span><span class="se">\n\</span><span class="s2">
export PATH=</span><span class="nv">$PATH</span><span class="se">\n</span><span class="s2">"</span> | <span class="nb">tee</span> /root/.bashrc <span class="o">></span> /home/<span class="nv">$scidb_usr</span>/.bashrc
<span class="c">### Activating and Verifying the New .bashrc File</span>
RUN <span class="nb">echo</span> <span class="nv">$SCIDB_VER</span>
RUN <span class="nb">echo</span> <span class="nv">$SCIDB_INSTALL_PATH</span>
RUN <span class="nb">echo</span> <span class="nv">$PATH</span>
</code></pre></div></div>
<p>Note that we set the environment variables both for the Docker build process, using the <code class="language-plaintext highlighter-rouge">ENV</code> statement (see <a href="https://docs.docker.com/engine/reference/builder/#/env">Docker documentation</a>) and for the login shell, using <code class="language-plaintext highlighter-rouge">export</code> and <code class="language-plaintext highlighter-rouge">.bashrc</code>.</p>
<p>In order to build SciDB, we use the <code class="language-plaintext highlighter-rouge">run.py</code> script, as described in the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Installing+SciDB+Community+Edition#InstallingSciDBCommunityEdition-BuildingSciDBCE">Building SciDB CE</a> section. Building requires a <code class="language-plaintext highlighter-rouge">setup</code> step and a <code class="language-plaintext highlighter-rouge">make</code> step as follows:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Building SciDB CE</span>
<span class="c">## ---</span>
RUN ./run.py setup <span class="nt">--force</span>
RUN ./run.py make <span class="nt">-j4</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">make</code> step might take somewhere between 30min-1h.</p>
<p>To install SciDB, we again use the <code class="language-plaintext highlighter-rouge">run.py</code> script, but we need to start both the SSH and the PostgreSQL servers before running the script:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">## Installing SciDB CE</span>
<span class="c">## ---</span>
RUN service ssh start <span class="o">&&</span> <span class="se">\</span>
service postgresql start <span class="o">&&</span> <span class="se">\</span>
<span class="nb">echo</span> <span class="s2">"</span><span class="se">\n\n</span><span class="s2">y"</span> | ./run.py <span class="nb">install</span> <span class="nt">--force</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">install</code> step is intended to be run interactively and prompts the user to answer a few questions. Since Docker does not support interactive build steps, we provide input for the <code class="language-plaintext highlighter-rouge">install</code> step using <code class="language-plaintext highlighter-rouge">echo</code>.</p>
<h1 id="starting-and-stopping-scidb">Starting and Stopping SciDB</h1>
<p>The image we built so far has everything needed to use SciDB. To make our image more user-friendly we add a script to be executed when a container is instantiated. The script follows the <a href="https://paradigm4.atlassian.net/wiki/display/ESD/Starting+and+Stopping+SciDB">Starting and Stopping SciDB</a> instructions from the installation guide:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>RUN <span class="nb">echo</span> <span class="s2">"#!/bin/bash</span><span class="se">\n\</span><span class="s2">
service ssh start</span><span class="se">\n\</span><span class="s2">
service postgresql start</span><span class="se">\n\</span><span class="s2">
scidb.py startall mydb</span><span class="se">\n\</span><span class="s2">
trap </span><span class="se">\"</span><span class="s2">scidb.py stopall mydb; service postgresql stop</span><span class="se">\"</span><span class="s2"> EXIT HUP INT QUIT TERM</span><span class="se">\n\</span><span class="s2">
bash"</span> <span class="o">></span> /docker-entrypoint.sh
RUN <span class="nb">chmod</span> +x /docker-entrypoint.sh
<span class="c">## Starting SciDB</span>
<span class="c">## ---</span>
ENTRYPOINT <span class="o">[</span><span class="s2">"/docker-entrypoint.sh"</span><span class="o">]</span>
</code></pre></div></div>
<p>The script is saved in the <code class="language-plaintext highlighter-rouge">docker-entrypoint.sh</code> file and it is created using <code class="language-plaintext highlighter-rouge">echo</code>. Normally we would have the script as a separate file and add it to the image. We chose the <code class="language-plaintext highlighter-rouge">echo</code> method in order to have everything in a single file. In the script, before starting SciDB, we first start the SSH and PostgreSQL servers. The script uses a <code class="language-plaintext highlighter-rouge">trap</code> to catch various exit signals (i.e., when the container is stopped) and stops SciDB and PostgreSQL before exiting. As the last step, the script starts a Bash shell for user’s convenience. Finally, in our image, we set the script as the container entry point using the <code class="language-plaintext highlighter-rouge">ENTRYPOINT</code> statement (see <a href="https://docs.docker.com/engine/reference/builder/#/entrypoint">Docker documentation</a>).</p>
<h1 id="using-the-scidb-image">Using the SciDB Image</h1>
<p>Once we have all the steps described above in a Dockerfile, we use <code class="language-plaintext highlighter-rouge">docker build</code> to build the final SciDB image:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat</span> <span class="o">></span> Dockerfile
<span class="c">## Requirements</span>
<span class="c">## ---</span>
FROM ubuntu:14.04
RUN apt-get update
...
<span class="c">## Starting SciDB</span>
<span class="c">## ---</span>
ENTRYPOINT <span class="o">[</span><span class="s2">"/docker-entrypoint.sh"</span><span class="o">]</span>
^D
<span class="nv">$ </span>docker build <span class="nt">--tag</span> scidb <span class="nb">.</span>
Sending build context to Docker daemon 10.75 kB
Step 1 : FROM ubuntu:14.04
<span class="nt">---</span><span class="o">></span> 38c759202e30
...
Step 45 : ENTRYPOINT /docker-entrypoint.sh
...
<span class="nt">---</span><span class="o">></span> 00ad89598441
Successfully built 00ad89598441
</code></pre></div></div>
<p>To use it, we start a Docker container with it:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>docker run <span class="nt">--tty</span> <span class="nt">--interactive</span> scidb
<span class="k">*</span> Starting OpenBSD Secure Shell server sshd <span class="o">[</span> OK <span class="o">]</span>
<span class="k">*</span> Starting PostgreSQL 9.3 database server <span class="o">[</span> OK <span class="o">]</span>
scidb.py: INFO: Found 0 scidb processes
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 0<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 1<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 2<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
scidb.py: INFO: start<span class="o">((</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">)</span> <span class="nb">local </span>instance 3<span class="o">))</span>
scidb.py: INFO: Starting SciDB server.
root@71db8492009c:/usr/src/scidbtrunk# iquery <span class="nt">--afl</span> <span class="nt">--query</span> <span class="s2">"list('libraries')"</span>
<span class="o">{</span>inst,n<span class="o">}</span> name,major,minor,patch,build,build_type
<span class="o">{</span>0,0<span class="o">}</span> <span class="s1">'SciDB'</span>,15,12,1,80403125,<span class="s1">'Debug'</span>
<span class="o">{</span>1,0<span class="o">}</span> <span class="s1">'SciDB'</span>,15,12,1,80403125,<span class="s1">'Debug'</span>
<span class="o">{</span>2,0<span class="o">}</span> <span class="s1">'SciDB'</span>,15,12,1,80403125,<span class="s1">'Debug'</span>
<span class="o">{</span>3,0<span class="o">}</span> <span class="s1">'SciDB'</span>,15,12,1,80403125,<span class="s1">'Debug'</span>
root@71db8492009c:/usr/src/scidbtrunk# <span class="nb">exit
</span>scidb.py: INFO: stop<span class="o">(</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">))</span>
scidb.py: INFO: checking <span class="o">(</span>server 0 <span class="o">(</span>127.0.0.1<span class="o">))</span> 119 120 121 122...
scidb.py: INFO: Found 4 scidb processes
scidb.py: INFO: Found 0 scidb processes
<span class="k">*</span> Stopping PostgreSQL 9.3 database server
</code></pre></div></div>
<p>Notice how SSH, PostgreSQL and SciDB servers are started when the container starts and stopped when the container stops.</p>
<p>Please note that the Dockerfile described in this post is space <em>inefficient</em> (its size is <code class="language-plaintext highlighter-rouge">6GB</code>) and does <em>not</em> follow the Dockerfile best practices. The image is built this way just for academic purposes. More efficient SciDB Docker images are available in the <a href="https://github.com/rvernica/docker-library/tree/master/scidb">docker-library</a> repository.</p>
<p>The full Dockerfile is available <a href="https://github.com/rvernica/scidb-examples/tree/master/docker-image">here</a>.</p>Rares VernicaDocker containers simplify the development and deployment of software through isolation. Containers are especially useful when the targeted software has a complicated installation procedure. This is the case for SciDB. SciDB needs to be compiled from source and requires a multitude of libraries and development tools to be installed. Moreover, SciDB is pretty specific on the type of operating system it supports. In this post, we look into how to create a Docker image for SciDB. The image can be used to launch Docker containers. The containers run isolated from the host operating system and on a multitude of operating systems, including Windows. We assume the reader has some familiarity with Docker and focus on SciDB particularities.The Power of Loading Data - Part 22016-06-01T00:00:00+00:002016-06-01T00:00:00+00:00http://rvernica.github.io/2016/06/load-data-non-int<p>In <a href="/2016/05/load-data">part 1</a> of this multi-part post, we gave a short overview to the <code class="language-plaintext highlighter-rouge">accelerated_io_tools</code> plugin provided by Paradigm4 and showed how to automate loading multiple files by capturing both the data in the file as well as the data in the file name. In this post, we go a step further and show how to capture additional non-integer data from the file name. We use the non-integer data to simulate a non-integer dimension (by using an additional reference array).</p>
<h1 id="multiple-files-with-non-integer-metadata">Multiple Files with Non-Integer Metadata</h1>
<p>Suppose for example that our data is in three files as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cat rec-A-1.txt
</span><span class="mi">10</span>
<span class="mi">20</span>
<span class="mi">30</span>
<span class="c1"># cat rec-A-2.txt
</span><span class="mi">40</span>
<span class="mi">50</span>
<span class="mi">60</span>
<span class="c1"># cat rec-B-1.txt
</span><span class="mi">70</span>
<span class="mi">80</span>
<span class="mi">90</span>
</code></pre></div></div>
<p>We would like to load both the data in the files, as well as the data in the file names (i.e., <code class="language-plaintext highlighter-rouge">A</code>, <code class="language-plaintext highlighter-rouge">B</code>, <code class="language-plaintext highlighter-rouge">1</code> and <code class="language-plaintext highlighter-rouge">2</code>). The destination array has one attribute and three dimensions. The file line number of the value is used as the first dimension, <code class="language-plaintext highlighter-rouge">line</code>. The letter in the file name is used as the second dimension, <code class="language-plaintext highlighter-rouge">letter</code>. An additional reference array is used to store the string value of the letters. Finally, the number in the file name is used as the third dimension, <code class="language-plaintext highlighter-rouge">num</code>. The schema of the destination array, <code class="language-plaintext highlighter-rouge">rec</code>, and the reference array, <code class="language-plaintext highlighter-rouge">rec_letter</code>, look like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">create</span> <span class="n">array</span> <span class="n">rec</span><span class="o"><</span><span class="n">val</span><span class="p">:</span><span class="n">int64</span><span class="o">></span> <span class="p">[</span><span class="n">line</span><span class="p">,</span> <span class="n">letter</span><span class="p">,</span> <span class="n">num</span><span class="p">];</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">create</span> <span class="n">array</span> <span class="n">rec_letter</span><span class="o"><</span><span class="n">val</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">letter</span><span class="p">];</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
</code></pre></div></div>
<h2 id="load-one-file">Load One File</h2>
<p>Let’s have a look at how to load data from one file. For example, let’s load the data from the <code class="language-plaintext highlighter-rouge">rec-A-1.txt</code> file. First, we have to insert <code class="language-plaintext highlighter-rouge">A</code> value into the reference array, <code class="language-plaintext highlighter-rouge">rec_letter</code>. To do this, we first have to <code class="language-plaintext highlighter-rouge">build</code> (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/build">documentation</a>) operator to build an array with <code class="language-plaintext highlighter-rouge">A</code>. Then we re-dimension this new array to match the shape of the <code class="language-plaintext highlighter-rouge">rec_letter</code> array:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">val</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">letter</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'A'</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">);</span>
<span class="p">{</span><span class="n">letter</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'A'</span>
</code></pre></div></div>
<p>Now we use the value <code class="language-plaintext highlighter-rouge">0</code> for the <code class="language-plaintext highlighter-rouge">letter</code> dimension and insert the rest of the data in the <code class="language-plaintext highlighter-rouge">rec</code> array:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-A-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=1'</span><span class="p">),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">line</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">letter</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span>
<span class="n">num</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">letter</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
</code></pre></div></div>
<p>Notice how the three dimensions (<code class="language-plaintext highlighter-rouge">line</code>, <code class="language-plaintext highlighter-rouge">letter</code>, and <code class="language-plaintext highlighter-rouge">num</code>) and the attribute (<code class="language-plaintext highlighter-rouge">val</code>) are set in the destination array. We can display the actual letter value using a <code class="language-plaintext highlighter-rouge">cross_join</code> (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/cross_join">documentation</a>) between the <code class="language-plaintext highlighter-rouge">rec</code> and the <code class="language-plaintext highlighter-rouge">rec_letter</code> arrays:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">cross_join</span><span class="p">(</span>
<span class="n">rec</span><span class="p">,</span>
<span class="n">rec_letter</span><span class="p">,</span>
<span class="n">rec</span><span class="p">.</span><span class="n">letter</span><span class="p">,</span>
<span class="n">rec_letter</span><span class="p">.</span><span class="n">letter</span><span class="p">);</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">letter</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span><span class="p">,</span><span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span><span class="p">,</span><span class="s">'A'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span><span class="p">,</span><span class="s">'A'</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span><span class="p">,</span><span class="s">'A'</span>
</code></pre></div></div>
<h3 id="look-up-in-reference-array">Look-up in Reference Array</h3>
<p>A first step to generalize this would be to look up the <code class="language-plaintext highlighter-rouge">A</code> letter in the <code class="language-plaintext highlighter-rouge">rec_letter</code> array and use its index instead of hard-coding it. We use the <code class="language-plaintext highlighter-rouge">index_lookup</code> operator (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/index_lookup">documentation</a>). This operator requires an input array, so we build an array around the <code class="language-plaintext highlighter-rouge">A</code> value:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">index_lookup</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">k</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'A'</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">,</span> <span class="n">k</span><span class="p">);</span>
<span class="p">{</span><span class="n">i</span><span class="p">}</span> <span class="n">k</span><span class="p">,</span><span class="n">k_index</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'A'</span><span class="p">,</span><span class="mi">0</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">k_index</code> contains the value that we need for the <code class="language-plaintext highlighter-rouge">letter</code> dimension. The value would be <code class="language-plaintext highlighter-rouge">null</code> if <code class="language-plaintext highlighter-rouge">A</code> is not found in <code class="language-plaintext highlighter-rouge">rec_letter</code>. To append the <code class="language-plaintext highlighter-rouge">k_index</code> value to the data read from the file, we have to do a <code class="language-plaintext highlighter-rouge">cross_join</code>. We then use <code class="language-plaintext highlighter-rouge">apply</code> to rename <code class="language-plaintext highlighter-rouge">k_index</code> to <code class="language-plaintext highlighter-rouge">letter</code>. The final query looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-A-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=1'</span><span class="p">),</span>
<span class="n">index_lookup</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">k</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'A'</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">,</span> <span class="n">k</span><span class="p">)),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">line</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">letter</span><span class="p">,</span> <span class="n">k_index</span><span class="p">,</span>
<span class="n">num</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">letter</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
</code></pre></div></div>
<p>The result is identical as the one of the previous <code class="language-plaintext highlighter-rouge">insert</code> query, except that we did not hard-code the <code class="language-plaintext highlighter-rouge">0</code> value for the <code class="language-plaintext highlighter-rouge">letter</code> dimension.</p>
<h3 id="insert-sequentially-if-not-found">Insert Sequentially If Not Found</h3>
<p>The next step in generalizing this load is to remove the hard-coded <code class="language-plaintext highlighter-rouge">0</code> dimension value we used to insert <code class="language-plaintext highlighter-rouge">A</code> in the <code class="language-plaintext highlighter-rouge">rec_letter</code> array. Instead, we assume that “letters” are sequentially inserted into the <code class="language-plaintext highlighter-rouge">rec_letter</code> array starting at dimension <code class="language-plaintext highlighter-rouge">0</code>. So, new “letters” should be inserted at the “end” of the array. We do this by counting how many cells are in the <code class="language-plaintext highlighter-rouge">rec_letter</code> array and inserting the new “letter” at the “count” position. In SciDB, this is done using the <code class="language-plaintext highlighter-rouge">cross_join</code> and <code class="language-plaintext highlighter-rouge">apply</code> operators:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">val</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'A'</span><span class="p">),</span>
<span class="n">aggregate</span><span class="p">(</span><span class="n">rec_letter</span><span class="p">,</span> <span class="n">count</span><span class="p">(</span><span class="o">*</span><span class="p">))),</span>
<span class="n">letter</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">count</span><span class="p">)),</span>
<span class="n">rec_letter</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">);</span>
<span class="p">{</span><span class="n">letter</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'A'</span>
</code></pre></div></div>
<p>Notice that we used <code class="language-plaintext highlighter-rouge">i</code> for dimension name in the <code class="language-plaintext highlighter-rouge">build</code> operator because we need to use <code class="language-plaintext highlighter-rouge">letter</code> later when assigning the count. Also, notice that if we run this query multiple times, multiple copies of <code class="language-plaintext highlighter-rouge">A</code> will be inserted in the <code class="language-plaintext highlighter-rouge">rec_letter</code> array, each time at an increasing position on the <code class="language-plaintext highlighter-rouge">letter</code> dimension (i.e., <code class="language-plaintext highlighter-rouge">1</code>, <code class="language-plaintext highlighter-rouge">2</code>, etc.). The final step to generalize loading this file is to insert the letter <code class="language-plaintext highlighter-rouge">A</code> in the <code class="language-plaintext highlighter-rouge">rec_letter</code> array only if it does not already exist. We do this by using the <code class="language-plaintext highlighter-rouge">index_lookup</code> and the <code class="language-plaintext highlighter-rouge">filter</code> operators (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/filter">documentation</a>). That is, we first search for <code class="language-plaintext highlighter-rouge">A</code> in <code class="language-plaintext highlighter-rouge">rec_letter</code>. If <code class="language-plaintext highlighter-rouge">A</code> is not found, the index attribute will be set to <code class="language-plaintext highlighter-rouge">null</code>. Using <code class="language-plaintext highlighter-rouge">filter</code> we filter the resulting array and only retain records which have the index attribute set to <code class="language-plaintext highlighter-rouge">null</code>. In other words, if <code class="language-plaintext highlighter-rouge">A</code> already exists in the array, the <code class="language-plaintext highlighter-rouge">insert</code> query inserts <code class="language-plaintext highlighter-rouge">0</code> new cells (since they are filtered), otherwise, it inserts <code class="language-plaintext highlighter-rouge">1</code> new cell. The query is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span>
<span class="nb">filter</span><span class="p">(</span>
<span class="n">index_lookup</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">k</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'A'</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">,</span> <span class="n">k</span><span class="p">),</span>
<span class="n">k_index</span> <span class="ow">is</span> <span class="n">null</span><span class="p">),</span>
<span class="n">aggregate</span><span class="p">(</span><span class="n">rec_letter</span><span class="p">,</span> <span class="n">count</span><span class="p">(</span><span class="o">*</span><span class="p">))),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span>
<span class="n">letter</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">count</span><span class="p">)),</span>
<span class="n">rec_letter</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">);</span>
<span class="p">{</span><span class="n">letter</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'A'</span>
</code></pre></div></div>
<p>Notice that we used <code class="language-plaintext highlighter-rouge">k</code> for the attribute name in the <code class="language-plaintext highlighter-rouge">build</code> operator in order to avoid name collision when calling the <code class="language-plaintext highlighter-rouge">index_lookup</code> operator. Re-running this query multiple times does not result in multiple copies of <code class="language-plaintext highlighter-rouge">A</code> being inserted. For completeness, the final and most general two queries to load one data file are:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span>
<span class="nb">filter</span><span class="p">(</span>
<span class="n">index_lookup</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">k</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'A'</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">,</span> <span class="n">k</span><span class="p">),</span>
<span class="n">k_index</span> <span class="ow">is</span> <span class="n">null</span><span class="p">),</span>
<span class="n">aggregate</span><span class="p">(</span><span class="n">rec_letter</span><span class="p">,</span> <span class="n">count</span><span class="p">(</span><span class="o">*</span><span class="p">))),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span>
<span class="n">letter</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">count</span><span class="p">)),</span>
<span class="n">rec_letter</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">);</span>
<span class="p">{</span><span class="n">letter</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'A'</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-A-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=1'</span><span class="p">),</span>
<span class="n">index_lookup</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">k</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'A'</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">,</span> <span class="n">k</span><span class="p">)),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">line</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">letter</span><span class="p">,</span> <span class="n">k_index</span><span class="p">,</span>
<span class="n">num</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">letter</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
</code></pre></div></div>
<p>Notice that the only information hard-coded in the query is the <code class="language-plaintext highlighter-rouge">A</code> (in the two <code class="language-plaintext highlighter-rouge">build</code> operators) and the <code class="language-plaintext highlighter-rouge">1</code> (assigned to <code class="language-plaintext highlighter-rouge">num</code> in the last <code class="language-plaintext highlighter-rouge">apply</code> operator) extracted from the file name.</p>
<h2 id="automated-loading">Automated Loading</h2>
<p>Now that we have a general query for loading one file, we automate the process for loading a possibly large number of files. The process is very similar with the one described in <a href="/2016/05/load-data#automate-loading">part 1</a> of this multi-part post. Essentially, we need a query template for inserting data from one file and a bash script for iterating over the files. The template query is identical to the one built in the previous section with three parameters. One parameter is the file name (<code class="language-plaintext highlighter-rouge">$T_FILE</code>) and the other two parameters are one each for each of the file meta-data arguments (<code class="language-plaintext highlighter-rouge">$T_LETTER</code> and <code class="language-plaintext highlighter-rouge">$T_NUM</code>):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cat rec.afl.tmpl
</span>
<span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span>
<span class="nb">filter</span><span class="p">(</span>
<span class="n">index_lookup</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">k</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'$T_LETTER'</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">,</span> <span class="n">k</span><span class="p">),</span>
<span class="n">k_index</span> <span class="ow">is</span> <span class="n">null</span><span class="p">),</span>
<span class="n">aggregate</span><span class="p">(</span><span class="n">rec_letter</span><span class="p">,</span> <span class="n">count</span><span class="p">(</span><span class="o">*</span><span class="p">))),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span>
<span class="n">letter</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">count</span><span class="p">)),</span>
<span class="n">rec_letter</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">);</span>
<span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">cross_join</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'$T_FILE'</span><span class="p">,</span> <span class="s">'num_attributes=1'</span><span class="p">),</span>
<span class="n">index_lookup</span><span class="p">(</span>
<span class="n">build</span><span class="p">(</span><span class="o"><</span><span class="n">k</span><span class="p">:</span><span class="n">string</span><span class="o">></span> <span class="p">[</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="err">?</span><span class="p">,</span><span class="err">?</span><span class="p">],</span> <span class="s">'$T_LETTER'</span><span class="p">),</span>
<span class="n">rec_letter</span><span class="p">,</span> <span class="n">k</span><span class="p">)),</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">line</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">letter</span><span class="p">,</span> <span class="n">k_index</span><span class="p">,</span>
<span class="n">num</span><span class="p">,</span> <span class="err">$</span><span class="n">T_NUM</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
</code></pre></div></div>
<p>Compared with the Bash script used in <a href="/2016/05/load-data#bash-script">part 1</a>, the Bash script we use here extracts the two meta-data value from the file name (the “letter” and the “number”), and uses them to parameterize the query template:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># cat rec.sh</span>
<span class="c">#!/bin/bash</span>
iquery <span class="nt">--afl</span> <span class="nt">--query</span> <span class="se">\</span>
<span class="s1">'create array rec<val:int64> [line, letter, num];
create array rec_letter<val:string> [letter]'</span>
<span class="nb">dir</span><span class="o">=</span><span class="si">$(</span><span class="nb">dirname</span> <span class="si">$(</span><span class="nb">readlink</span> <span class="nt">-f</span> <span class="nv">$0</span><span class="si">))</span>
<span class="nv">query_file</span><span class="o">=</span><span class="sb">`</span><span class="nb">mktemp</span><span class="sb">`</span>
<span class="k">for </span>file <span class="k">in</span> <span class="nv">$dir</span>/rec-<span class="k">*</span>.txt
<span class="k">do
</span><span class="nv">num</span><span class="o">=</span><span class="k">${</span><span class="nv">file</span><span class="p">//[^0-9]/</span><span class="k">}</span>
<span class="nv">letter</span><span class="o">=</span><span class="k">${</span><span class="nv">file</span><span class="p">//[^A-Z]/</span><span class="k">}</span>
<span class="nb">env </span><span class="nv">T_FILE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span> <span class="nv">T_NUM</span><span class="o">=</span><span class="s2">"</span><span class="nv">$num</span><span class="s2">"</span> <span class="nv">T_LETTER</span><span class="o">=</span><span class="s2">"</span><span class="nv">$letter</span><span class="s2">"</span> envsubst <span class="se">\</span>
< <span class="nv">$dir</span>/rec.afl.tmpl <span class="o">>></span> <span class="nv">$query_file</span>
<span class="k">done
</span>iquery <span class="nt">--afl</span> <span class="nt">--query-file</span> <span class="nv">$query_file</span>
<span class="nb">rm</span> <span class="s2">"</span><span class="nv">$query_file</span><span class="s2">"</span>
</code></pre></div></div>
<p>Note that we are using very simple pattern matching expressions to extract the meta-data from the file names (<code class="language-plaintext highlighter-rouge">num</code> and <code class="language-plaintext highlighter-rouge">letter</code> variables). In practice these might need to be adjusted. The output from executing the script looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ./rec.sh
</span><span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
<span class="p">{</span><span class="n">letter</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'A'</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">letter</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
<span class="p">{</span><span class="n">letter</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'A'</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">letter</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">40</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">50</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">60</span>
<span class="p">{</span><span class="n">letter</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">}</span> <span class="s">'A'</span>
<span class="p">{</span><span class="mi">1</span><span class="p">}</span> <span class="s">'B'</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">letter</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">40</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">70</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">50</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">80</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">60</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">90</span>
</code></pre></div></div>
<p>Notice that even if we have two input files with <code class="language-plaintext highlighter-rouge">A</code> in the filename (<code class="language-plaintext highlighter-rouge">rec-A-1.txt</code> and <code class="language-plaintext highlighter-rouge">rec-A-2.txt</code>), <code class="language-plaintext highlighter-rouge">A</code> only appears once in the <code class="language-plaintext highlighter-rouge">rec_letter</code> array. Also, the two arrays (<code class="language-plaintext highlighter-rouge">rec_letter</code> and <code class="language-plaintext highlighter-rouge">rec</code>) are listed in the output after each insert query, so you can see then “growing” in the output.</p>
<p>The input data, the query template, and the Bash script are available <a href="https://github.com/rvernica/scidb-examples/tree/master/data-load-non-int">here</a>.</p>Rares VernicaIn part 1 of this multi-part post, we gave a short overview to the accelerated_io_tools plugin provided by Paradigm4 and showed how to automate loading multiple files by capturing both the data in the file as well as the data in the file name. In this post, we go a step further and show how to capture additional non-integer data from the file name. We use the non-integer data to simulate a non-integer dimension (by using an additional reference array).The Power of Loading Data - Part 12016-05-01T00:00:00+00:002016-05-01T00:00:00+00:00http://rvernica.github.io/2016/05/load-data<p>Loading data into a database is a very important operation. SciDB is no exception. Build-in SciDB is the vanilla <code class="language-plaintext highlighter-rouge">load</code> operator documented <a href="https://paradigm4.atlassian.net/wiki/display/ESD/load">here</a>. On top of that Paradigm4 provides an advanced loading operator as part of their <a href="https://github.com/Paradigm4/accelerated_io_tools">accelerated_io_tools</a> plugin. Similar to SciDB, this plugin is open-source but different from SciDB, this plugin is on GitHub and contributions are welcome. The loading operator provided in this plugin has a couple of advanced features including:</p>
<ol>
<li>Fully distributed parsing and packing</li>
<li>Loading from multiple files</li>
<li>Error tolerance</li>
</ol>
<p>We highly recommend taking the time to understand its <a href="https://github.com/Paradigm4/accelerated_io_tools/blob/master/README.md">usage</a>. The easiest way to install this plugin is to install the <code class="language-plaintext highlighter-rouge">dev_tools</code> (see <a href="https://github.com/Paradigm4/dev_tools">GitHub</a>) plugin first (<em>Update:</em> see this <a href="/2016/10/extend-scidb-doc#installing-the-development-tools">post</a> for a discussion). Using the <code class="language-plaintext highlighter-rouge">dev_tools</code> plugin, the <code class="language-plaintext highlighter-rouge">accelerated_io_tools</code> plugin can be installed using the AFL query <code class="language-plaintext highlighter-rouge">install_github('paradigm4/accelerated_io_tools')</code>. In this multi-part post, we discuss various use-cases of loading data using this plugin. In this part, we discuss how to load data from multiple files as well as loading the metadata encoded in the file name.</p>
<h1 id="multiple-files-with-metadata">Multiple Files with Metadata</h1>
<p>As already mentioned, the loading operator provided in <code class="language-plaintext highlighter-rouge">accelerated_io_tools</code> (i.e., <code class="language-plaintext highlighter-rouge">aio_input</code>) is capable of loading data from multiple files at once (see <a href="https://github.com/Paradigm4/accelerated_io_tools/blob/master/README.md#2-loading-from-multiple-files">Loading from multiple files</a> and <a href="https://github.com/Paradigm4/accelerated_io_tools/blob/master/README.md#load-from-one-or-multiple-files">Load from one or multiple files</a> sections in the plugin documentation). To use the operator, the user is required to list all the file names as part of the operator arguments. Without a doubt, this has its uses and definitely the advantage of running distributed.</p>
<p>On the other hand, imagine a situation where you are loading data from a possibly large number of files. In this case listing all the file names as part of the query is not practical nor possible. Moreover, imagine there is additional metadata encoded in the files names that you would like to capture in the database as well.</p>
<p>Assume, for example, that our data is in two files as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cat rec-1.txt
</span><span class="mi">10</span>
<span class="mi">20</span>
<span class="mi">30</span>
<span class="c1"># cat rec-2.txt
</span><span class="mi">40</span>
<span class="mi">50</span>
<span class="mi">60</span>
</code></pre></div></div>
<p>We would like to load both the data in the files, as well as the metadata in the file names (i.e., <code class="language-plaintext highlighter-rouge">1</code> and <code class="language-plaintext highlighter-rouge">2</code>). The destination array has two dimensions. The data in the files goes in the first dimension, <code class="language-plaintext highlighter-rouge">line</code>, while the metadata in the file names goes in the second dimension, <code class="language-plaintext highlighter-rouge">num</code>. The schema of the destination array looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">create</span> <span class="n">array</span> <span class="n">rec</span><span class="o"><</span><span class="n">val</span><span class="p">:</span><span class="n">int64</span><span class="o">></span> <span class="p">[</span><span class="n">line</span><span class="p">,</span> <span class="n">num</span><span class="p">];</span>
<span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
</code></pre></div></div>
<h2 id="load-one-file">Load One File</h2>
<p>Let’s see how we can load the data from one file first and then we extend the procedure to multiple files. For loading the data from one file we can start with:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=1'</span><span class="p">),</span>
<span class="n">rec_file</span><span class="p">);</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">}</span> <span class="n">a0</span><span class="p">,</span><span class="n">error</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'10'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'20'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'30'</span><span class="p">,</span><span class="n">null</span>
</code></pre></div></div>
<p>Refer to the <a href="https://github.com/Paradigm4/accelerated_io_tools/blob/master/README.md#trivial-end-to-end-example">Trivial end-to-end example</a> in the plugin documentation for understanding the parameters and the output. Notice that the operator requires the full path to the data file. The data is stored temporarily in the <code class="language-plaintext highlighter-rouge">rec_file</code> array. Next, we have to re-dimension it and add the metadata from the file name:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># iquery --afl
</span><span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">rec_file</span><span class="p">,</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">line</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">num</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
</code></pre></div></div>
<p>We did quite a bit in this query, let’s walk step by step. The inner-most operator is an <code class="language-plaintext highlighter-rouge">apply</code> operator (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/apply">documentation</a>). This operator takes as input the <code class="language-plaintext highlighter-rouge">rec_file</code> array and prepares the attributes and dimensions for our destination array <code class="language-plaintext highlighter-rouge">rec</code>. Remember, the <code class="language-plaintext highlighter-rouge">rec</code> array has one attribute (<code class="language-plaintext highlighter-rouge">val</code>) and two dimensions (<code class="language-plaintext highlighter-rouge">line</code> and <code class="language-plaintext highlighter-rouge">num</code>). The <code class="language-plaintext highlighter-rouge">apply</code> operator performs the following operations:</p>
<ol>
<li>Converts the values read from the file to integers and stores them in the <code class="language-plaintext highlighter-rouge">val</code> attribute. (<code class="language-plaintext highlighter-rouge">... val, int64(a0), ...</code>)</li>
<li>Copies the <code class="language-plaintext highlighter-rouge">tuple_no</code> dimension to <code class="language-plaintext highlighter-rouge">line</code> attribute (<code class="language-plaintext highlighter-rouge">... line, tuple_no, ...</code>)</li>
<li>Sets the <code class="language-plaintext highlighter-rouge">num</code> attribute to <code class="language-plaintext highlighter-rouge">1</code> (<code class="language-plaintext highlighter-rouge">... num, 1), ...</code>)</li>
</ol>
<p>The <code class="language-plaintext highlighter-rouge">apply</code> operator is followed by the <code class="language-plaintext highlighter-rouge">redimension</code> operator (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/redimension">documentation</a>). This operator converts the input array to the structure of the <code class="language-plaintext highlighter-rouge">rec</code> array by dropping extra dimensions and attributes, and mapping attributes to dimensions (<code class="language-plaintext highlighter-rouge">line</code> and <code class="language-plaintext highlighter-rouge">num</code>). As an exercise, run only the <code class="language-plaintext highlighter-rouge">apply</code> operator and examine the output. It has three dimensions and five attributes.</p>
<p>The output of the <code class="language-plaintext highlighter-rouge">redimension</code> operator is passed as input to the <code class="language-plaintext highlighter-rouge">insert</code> operator (see <a href="https://paradigm4.atlassian.net/wiki/display/ESD/insert">documentation</a>) which inserts the data into the <code class="language-plaintext highlighter-rouge">rec</code> array. The resulting <code class="language-plaintext highlighter-rouge">rec</code> array is printed at the console. Notice the values for the attribute (<code class="language-plaintext highlighter-rouge">val</code>) and the two dimensions (<code class="language-plaintext highlighter-rouge">line</code> and <code class="language-plaintext highlighter-rouge">num</code>).</p>
<h2 id="automated-loading">Automated Loading</h2>
<p>Now that we know how to load one file as well as store the metadata encoded in the file name, let’s see how we can automate this process and load a possibly large number of files. Essentially, we have to write a Bash script which generates two queries for each input file. The two queries are the <code class="language-plaintext highlighter-rouge">store</code> and <code class="language-plaintext highlighter-rouge">insert</code> queries from above but customized for each file. The customization includes reading the right input file from the disk and setting the right value for the <code class="language-plaintext highlighter-rouge">num</code> dimension (hard-codded to <code class="language-plaintext highlighter-rouge">1</code> in the example above).</p>
<h3 id="query-template">Query Template</h3>
<p>The first step is to list the two queries in a file and parametrize them as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># cat rec.afl.tmpl
</span>
<span class="n">store</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'$T_FILE'</span><span class="p">,</span> <span class="s">'num_attributes=1'</span><span class="p">),</span>
<span class="n">rec_file</span><span class="p">);</span>
<span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">rec_file</span><span class="p">,</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">line</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">num</span><span class="p">,</span> <span class="err">$</span><span class="n">T_NUM</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
</code></pre></div></div>
<p>The file above is a template file with two parameters <code class="language-plaintext highlighter-rouge">$T_FILE</code> and <code class="language-plaintext highlighter-rouge">$T_NUM</code>. We can merge the two queries in a single query by moving the <code class="language-plaintext highlighter-rouge">aio_input</code> operator inside the <code class="language-plaintext highlighter-rouge">apply</code> operator (replacing <code class="language-plaintext highlighter-rouge">rec_file</code>) and removing the <code class="language-plaintext highlighter-rouge">store</code> operator. We leave this as an exercise to the reader. Using the <code class="language-plaintext highlighter-rouge">envsubst</code> Linux command, we can easily instantiate the template and provide values for these two parameters. The result is a valid query:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># env T_FILE="/rec-1.txt" T_NUM=1 envsubst < rec.afl.tmpl
</span>
<span class="n">store</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=1'</span><span class="p">),</span>
<span class="n">rec_file</span><span class="p">);</span>
<span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">rec_file</span><span class="p">,</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">line</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">num</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
</code></pre></div></div>
<p>Notice how the template parameters have been replaced. Also, notice that if the files paths and names follow the same convention, we could use only one template parameter, <code class="language-plaintext highlighter-rouge">$T_NUM</code>, and construct the file path from it. Now, we can easily run the resulting query by piping it through <code class="language-plaintext highlighter-rouge">iquery</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># env T_FILE="/rec-1.txt" T_NUM=1 envsubst < rec.afl.tmpl | iquery --afl
</span><span class="n">AFL</span><span class="o">%</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">store</span><span class="p">(</span>
<span class="n">aio_input</span><span class="p">(</span><span class="s">'/rec-1.txt'</span><span class="p">,</span> <span class="s">'num_attributes=1'</span><span class="p">),</span>
<span class="n">rec_file</span><span class="p">);</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">}</span> <span class="n">a0</span><span class="p">,</span><span class="n">error</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'10'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'20'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'30'</span><span class="p">,</span><span class="n">null</span>
<span class="n">AFL</span><span class="o">%</span>
<span class="n">AFL</span><span class="o">%</span> <span class="n">insert</span><span class="p">(</span>
<span class="n">redimension</span><span class="p">(</span>
<span class="nb">apply</span><span class="p">(</span>
<span class="n">rec_file</span><span class="p">,</span>
<span class="n">val</span><span class="p">,</span> <span class="n">int64</span><span class="p">(</span><span class="n">a0</span><span class="p">),</span>
<span class="n">line</span><span class="p">,</span> <span class="n">tuple_no</span><span class="p">,</span>
<span class="n">num</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">rec</span><span class="p">),</span>
<span class="n">rec</span><span class="p">);</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
</code></pre></div></div>
<h3 id="bash-script">Bash Script</h3>
<p>Finally, we can write a Bash script which loops over the files and instantiates the query template for each file. The resulting queries are collected into a temporary file which is then run against the database. The Bash script follows:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># cat rec.sh</span>
<span class="c">#!/bin/bash</span>
iquery <span class="nt">--afl</span> <span class="nt">--query</span> <span class="se">\</span>
<span class="s1">'create array rec<val:int64> [line, num]'</span>
<span class="nb">dir</span><span class="o">=</span><span class="si">$(</span><span class="nb">dirname</span> <span class="si">$(</span><span class="nb">readlink</span> <span class="nt">-f</span> <span class="nv">$0</span><span class="si">))</span>
<span class="nv">query_file</span><span class="o">=</span><span class="sb">`</span><span class="nb">mktemp</span><span class="sb">`</span>
<span class="k">for </span>file <span class="k">in</span> <span class="nv">$dir</span>/rec-<span class="k">*</span>.txt
<span class="k">do
</span><span class="nv">num</span><span class="o">=</span><span class="k">${</span><span class="nv">file</span><span class="p">//[^0-9]/</span><span class="k">}</span>
<span class="nb">env </span><span class="nv">T_FILE</span><span class="o">=</span><span class="s2">"</span><span class="nv">$file</span><span class="s2">"</span> <span class="nv">T_NUM</span><span class="o">=</span><span class="s2">"</span><span class="nv">$num</span><span class="s2">"</span> envsubst <span class="se">\</span>
< <span class="nv">$dir</span>/rec.afl.tmpl <span class="o">>></span> <span class="nv">$query_file</span>
<span class="k">done
</span>iquery <span class="nt">--afl</span> <span class="nt">--query-file</span> <span class="nv">$query_file</span>
<span class="nb">rm</span> <span class="s2">"</span><span class="nv">$query_file</span><span class="s2">"</span>
</code></pre></div></div>
<p>The script starts by creating the <code class="language-plaintext highlighter-rouge">rec</code> array. It extracts the information from the file name in the <code class="language-plaintext highlighter-rouge">num</code> variable. The file names used in this case are simple and the data is easy to extract. In practice, more complicated pattern matching might be needed. The variable <code class="language-plaintext highlighter-rouge">query_file</code> holds the name of a temporary file which contains the queries needed to load all the data files. These queries are run against the database at the end of the script and the temporary file is removed. The output from executing the script looks like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ./rec.sh
</span><span class="n">Query</span> <span class="n">was</span> <span class="n">executed</span> <span class="n">successfully</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">}</span> <span class="n">a0</span><span class="p">,</span><span class="n">error</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'10'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'20'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'30'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
<span class="p">{</span><span class="n">tuple_no</span><span class="p">,</span><span class="n">dst_instance_id</span><span class="p">,</span><span class="n">src_instance_id</span><span class="p">}</span> <span class="n">a0</span><span class="p">,</span><span class="n">error</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'40'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'50'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">}</span> <span class="s">'60'</span><span class="p">,</span><span class="n">null</span>
<span class="p">{</span><span class="n">line</span><span class="p">,</span><span class="n">num</span><span class="p">}</span> <span class="n">val</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">10</span>
<span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">40</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">20</span>
<span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">50</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">}</span> <span class="mi">30</span>
<span class="p">{</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">}</span> <span class="mi">60</span>
</code></pre></div></div>
<p>The last <code class="language-plaintext highlighter-rouge">{line,num} val</code> listing contains the final contents of the <code class="language-plaintext highlighter-rouge">rec</code> array created in the beginning of the post. The array holds the data from both input files with the proper <code class="language-plaintext highlighter-rouge">line</code> and <code class="language-plaintext highlighter-rouge">num</code> dimensions set. In practice, you might want to run the <code class="language-plaintext highlighter-rouge">iquery</code> command with the <code class="language-plaintext highlighter-rouge">--no-fetch</code> argument so that data is not fetched and printed at the output.</p>
<p>Please note that while we have a general process for loading multiple files in SciDB, we are not taking advantage of the distributed loading capabilities of the <code class="language-plaintext highlighter-rouge">accelerated_io_tools</code> plugin. We are essentially loading the data, one file at a time.</p>
<p>The input data, the query template, and the Bash script are available <a href="https://github.com/rvernica/scidb-examples/tree/master/data-load">here</a>.</p>Rares VernicaLoading data into a database is a very important operation. SciDB is no exception. Build-in SciDB is the vanilla load operator documented here. On top of that Paradigm4 provides an advanced loading operator as part of their accelerated_io_tools plugin. Similar to SciDB, this plugin is open-source but different from SciDB, this plugin is on GitHub and contributions are welcome. The loading operator provided in this plugin has a couple of advanced features including:Getting Started with SciDB2016-04-01T00:00:00+00:002016-04-01T00:00:00+00:00http://rvernica.github.io/2016/04/getting-started<p>The easiest way to get started with SciDB is using Amazon Web Services (AWS). There are multiple SciDB Amazon Machine Images (AMI) provided by <a href="http://www.paradigm4.com">Paradigm4</a> (the company behind SciDB). For each AMI, the type of Amazon Elastic Cloud Compute (EC2) instance recommended by Paradigm4 is pretty beefy. We recommend following the Paradigm44 instructions. The EC2 instances can be easily <em>stopped</em> or <em>terminated</em> from the <em>EC2 Instances</em> page.</p>
<h1 id="available-scidb-amis">Available SciDB AMIs</h1>
<p>The official Paradigm4 SciDB AMIs for different SciDB versions are listed below. For each, we provide a link to the <em>EC2 Images</em> search page on AWS. These links require a valid AWS account.</p>
<table>
<thead>
<tr>
<th>Version</th>
<th>Release Month</th>
<th>Docs</th>
<th>Name</th>
<th>AMI ID</th>
<th>AMI</th>
<th>Quick Start</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>15.12</strong></td>
<td>Apr ‘16</td>
<td><a href="http://www.paradigm4.com/docs/15.12">Docs</a></td>
<td><em>SciDB 15.12 Bioinformatics and Finance</em></td>
<td><code class="language-plaintext highlighter-rouge">ami-07c2d06d</code></td>
<td><a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;imageId=ami-07c2d06d;sort=name">AMI</a></td>
<td>Available on request<sup><a href="#footnote1">1</a></sup></td>
</tr>
<tr>
<td>15.7</td>
<td>Aug ‘15</td>
<td><a href="http://www.paradigm4.com/HTMLmanual/15.7/scidb_ug">Docs</a></td>
<td> </td>
<td>No AMI available</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>14.12</td>
<td>Jan ‘15</td>
<td><a href="http://www.paradigm4.com/HTMLmanual/14.12/scidb_ug">Docs</a></td>
<td><em>SciDB 14.12 + shim + SciDBR + SciDBPy + IPython Notebook</em></td>
<td><code class="language-plaintext highlighter-rouge">ami-3cace654</code></td>
<td><a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;imageId=ami-3cace654;sort=name">AMI</a></td>
<td> </td>
</tr>
<tr>
<td>14.8</td>
<td>Aug ‘14</td>
<td><a href="http://www.paradigm4.com/HTMLmanual/14.8/scidb_ug">Docs</a></td>
<td><em>SciDB_14.8_2</em> <br /> <em>SciDB_14.8</em></td>
<td><code class="language-plaintext highlighter-rouge">ami-eef47286</code> <br /> <code class="language-plaintext highlighter-rouge">ami-aef274c6</code></td>
<td><a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;imageId=ami-eef47286;sort=name">AMI</a> <br /> <a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;imageId=ami-aef274c6;sort=name">AMI</a></td>
<td><a href="http://forum.paradigm4.com/uploads/db6652/original/1X/9774b456f8ebde4ce40314f9b2265b3d2740fa7a.pdf">Quick Start</a> (Sec.2)</td>
</tr>
<tr>
<td>14.3</td>
<td>Apr ‘14</td>
<td><a href="http://scidb.org/HTMLmanual/14.3/scidb_ug">Docs</a></td>
<td><em>SciDB14.3</em></td>
<td><code class="language-plaintext highlighter-rouge">ami-7592881c</code></td>
<td><a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;imageId=ami-7592881c;sort=name">AMI</a></td>
<td><a href="http://forum.paradigm4.com/uploads/db6652/original/1X/cafaf09765d64b024a32eba012369f700c784fa7.pdf">Quick Start</a><sup><a href="#footnote2">2</a></sup> (Sec.2.2)</td>
</tr>
<tr>
<td>13.12</td>
<td>Jan ‘14</td>
<td><a href="http://scidb.org/HTMLmanual/13.12/scidb_ug/">Docs</a></td>
<td><em>Scidb 13.12 Quick Start image</em></td>
<td><code class="language-plaintext highlighter-rouge">ami-9f132cf6</code></td>
<td><a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;imageId=ami-9f132cf6;sort=name">AMI</a></td>
<td><a href="http://forum.paradigm4.com/uploads/db6652/original/1X/e19f73c889143ad09f486f61d3aabf7c68c9099e.pdf">Quick Start</a> (Sec.2.2)</td>
</tr>
<tr>
<td>Older</td>
<td> </td>
<td> </td>
<td> </td>
<td>Owner <code class="language-plaintext highlighter-rouge">984687207943</code></td>
<td><a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;ownerAlias=984687207943;sort=name">AMIs</a></td>
<td> </td>
</tr>
</tbody>
</table>
<p><a name="footnote1" class="anchor"><sup>1</sup></a> Quick Start available on request from Paradigm4. See release <a href="http://forum.paradigm4.com/t/scidb-release-15-12/1186">announcement</a>.<br />
<a name="footnote2" class="anchor"><sup>2</sup></a> AMI ID listed in Quick Start is for version 13.12 which is probably a typo.</p>
<p>Additionally, you can see a list of all AMI with SciDB in their name by using <em>AMI Name</em> <code class="language-plaintext highlighter-rouge">SciDB</code> as search criteria (<a href="https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Images:visibility=public-images;name=SciDB;sort=name">AMIs </a>). Use with caution. Below is a screenshot of the <em>EC2 Images</em> result list.</p>
<p><img src="/assets/img/posts/ami-list.jpg" alt="AMI list screenshot" /></p>
<h1 id="scidb-1512-ami">SciDB 15.12 AMI</h1>
<p>Let’s take a look at the AMI provided for SciDB 15.12. The Quick Start guide is not directly available. Interested parties have to contact Paradigm4 in order to obtain it.</p>
<p>When the AMI is started, SciDB, Shim, RStudio, and Jupyter are also started. <a href="https://github.com/Paradigm4/shim">Shim</a> is a simple HTTP client which allows you to run queries against SciDB from the web browser. It also has a <em>Dashboard</em> page listing the SciDB instances. Shim starts on port <code class="language-plaintext highlighter-rouge">8080</code>. Below is a screenshot of Shim running on the AMI.</p>
<p><img src="/assets/img/posts/shim.jpg" alt="Shim screenshot" /></p>
<p><a href="https://www.rstudio.com/">RStudio</a> starts on port <code class="language-plaintext highlighter-rouge">8787</code>. There are quite a few examples of using SciDB from R provided with the AMI. Below is a screenshot of RStudio running on the AMI.</p>
<p><img src="/assets/img/posts/rstudio.jpg" alt="RStudio screenshot" /></p>
<p>Finally, <a href="http://jupyter.org/">Jupyter (IPython)</a> starts on port <code class="language-plaintext highlighter-rouge">8888</code>. Provided with the AMI is an example of financial data analysis. In order to access this example you have to first remote-login on the instance and run:</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo cp</span> <span class="nt">-r</span> ~scidb_finance/TAQ ~scidb_bio/
<span class="nb">sudo chown</span> <span class="nt">-R</span> scidb_bio:scidb_bio ~scidb_bio/TAQ
</code></pre></div></div>
<p>These two commands copy the example files from the <code class="language-plaintext highlighter-rouge">scidb_finance</code> account into the <code class="language-plaintext highlighter-rouge">scidb_bio</code> account and set their ownership to <code class="language-plaintext highlighter-rouge">scidb_bio</code>. This is required because the Jupyter service starts under the <code class="language-plaintext highlighter-rouge">scidb_bio</code> account. Below is a screenshot of Jupyter running on the AMI.</p>
<p><img src="/assets/img/posts/jupyter.jpg" alt="Jupyter screenshot" /></p>Rares VernicaThe easiest way to get started with SciDB is using Amazon Web Services (AWS). There are multiple SciDB Amazon Machine Images (AMI) provided by Paradigm4 (the company behind SciDB). For each AMI, the type of Amazon Elastic Cloud Compute (EC2) instance recommended by Paradigm4 is pretty beefy. We recommend following the Paradigm44 instructions. The EC2 instances can be easily stopped or terminated from the EC2 Instances page.