Loading data into a database is a very important operation. SciDB is no exception. Build-in SciDB is the vanilla load operator documented here. On top of that Paradigm4 provides an advanced loading operator as part of their accelerated_io_tools plugin. Similar to SciDB, this plugin is open-source but different from SciDB, this plugin is on GitHub and contributions are welcome. The loading operator provided in this plugin has a couple of advanced features including:

  1. Fully distributed parsing and packing
  2. Loading from multiple files
  3. Error tolerance

We highly recommend taking the time to understand its usage. The easiest way to install this plugin is to install the dev_tools (see GitHub) plugin first (Update: see this post for a discussion). Using the dev_tools plugin, the accelerated_io_tools plugin can be installed using the AFL query install_github('paradigm4/accelerated_io_tools'). In this multi-part post, we discuss various use-cases of loading data using this plugin. In this part, we discuss how to load data from multiple files as well as loading the metadata encoded in the file name.

Multiple Files with Metadata

As already mentioned, the loading operator provided in accelerated_io_tools (i.e., aio_input) is capable of loading data from multiple files at once (see Loading from multiple files and Load from one or multiple files sections in the plugin documentation). To use the operator, the user is required to list all the file names as part of the operator arguments. Without a doubt, this has its uses and definitely the advantage of running distributed.

On the other hand, imagine a situation where you are loading data from a possibly large number of files. In this case listing all the file names as part of the query is not practical nor possible. Moreover, imagine there is additional metadata encoded in the files names that you would like to capture in the database as well.

Assume, for example, that our data is in two files as follows:

# cat rec-1.txt
10
20
30
# cat rec-2.txt
40
50
60

We would like to load both the data in the files, as well as the metadata in the file names (i.e., 1 and 2). The destination array has two dimensions. The data in the files goes in the first dimension, line, while the metadata in the file names goes in the second dimension, num. The schema of the destination array looks like this:

# iquery --afl
AFL% create array rec<val:int64> [line, num];
Query was executed successfully

Load One File

Let’s see how we can load the data from one file first and then we extend the procedure to multiple files. For loading the data from one file we can start with:

# iquery --afl
AFL% store(
       aio_input('/rec-1.txt', 'num_attributes=1'),
       rec_file);
{tuple_no,dst_instance_id,src_instance_id} a0,error
{0,0,0} '10',null
{1,0,0} '20',null
{2,0,0} '30',null

Refer to the Trivial end-to-end example in the plugin documentation for understanding the parameters and the output. Notice that the operator requires the full path to the data file. The data is stored temporarily in the rec_file array. Next, we have to re-dimension it and add the metadata from the file name:

# iquery --afl
AFL% insert(
       redimension(
          apply(
            rec_file,
            val, int64(a0),
            line, tuple_no,
            num, 1),
          rec),
       rec);
{line,num} val
{0,1} 10
{1,1} 20
{2,1} 30

We did quite a bit in this query, let’s walk step by step. The inner-most operator is an apply operator (see documentation). This operator takes as input the rec_file array and prepares the attributes and dimensions for our destination array rec. Remember, the rec array has one attribute (val) and two dimensions (line and num). The apply operator performs the following operations:

  1. Converts the values read from the file to integers and stores them in the val attribute. (... val, int64(a0), ...)
  2. Copies the tuple_no dimension to line attribute (... line, tuple_no, ...)
  3. Sets the num attribute to 1 (... num, 1), ...)

The apply operator is followed by the redimension operator (see documentation). This operator converts the input array to the structure of the rec array by dropping extra dimensions and attributes, and mapping attributes to dimensions (line and num). As an exercise, run only the apply operator and examine the output. It has three dimensions and five attributes.

The output of the redimension operator is passed as input to the insert operator (see documentation) which inserts the data into the rec array. The resulting rec array is printed at the console. Notice the values for the attribute (val) and the two dimensions (line and num).

Automated Loading

Now that we know how to load one file as well as store the metadata encoded in the file name, let’s see how we can automate this process and load a possibly large number of files. Essentially, we have to write a Bash script which generates two queries for each input file. The two queries are the store and insert queries from above but customized for each file. The customization includes reading the right input file from the disk and setting the right value for the num dimension (hard-codded to 1 in the example above).

Query Template

The first step is to list the two queries in a file and parametrize them as follows:

# cat rec.afl.tmpl

store(
  aio_input('$T_FILE', 'num_attributes=1'),
  rec_file);

insert(
  redimension(
    apply(
      rec_file,
      val, int64(a0),
      line, tuple_no,
      num, $T_NUM),
    rec),
  rec);

The file above is a template file with two parameters $T_FILE and $T_NUM. We can merge the two queries in a single query by moving the aio_input operator inside the apply operator (replacing rec_file) and removing the store operator. We leave this as an exercise to the reader. Using the envsubst Linux command, we can easily instantiate the template and provide values for these two parameters. The result is a valid query:

# env T_FILE="/rec-1.txt" T_NUM=1 envsubst < rec.afl.tmpl

store(
  aio_input('/rec-1.txt', 'num_attributes=1'),
  rec_file);

insert(
  redimension(
    apply(
      rec_file,
      val, int64(a0),
      line, tuple_no,
      num, 1),
    rec),
  rec);

Notice how the template parameters have been replaced. Also, notice that if the files paths and names follow the same convention, we could use only one template parameter, $T_NUM, and construct the file path from it. Now, we can easily run the resulting query by piping it through iquery:

# env T_FILE="/rec-1.txt" T_NUM=1 envsubst < rec.afl.tmpl | iquery --afl
AFL%
AFL% store(
       aio_input('/rec-1.txt', 'num_attributes=1'),
       rec_file);
{tuple_no,dst_instance_id,src_instance_id} a0,error
{0,0,0} '10',null
{1,0,0} '20',null
{2,0,0} '30',null
AFL%
AFL% insert(
       redimension(
         apply(
           rec_file,
           val, int64(a0),
           line, tuple_no,
           num, 1),
         rec),
       rec);
{line,num} val
{0,1} 10
{1,1} 20
{2,1} 30

Bash Script

Finally, we can write a Bash script which loops over the files and instantiates the query template for each file. The resulting queries are collected into a temporary file which is then run against the database. The Bash script follows:

# cat rec.sh
#!/bin/bash

iquery --afl --query \
       'create array rec<val:int64> [line, num]'

dir=$(dirname $(readlink -f $0))
query_file=`mktemp`

for file in $dir/rec-*.txt
do
    num=${file//[^0-9]/}

    env T_FILE="$file" T_NUM="$num" envsubst \
        < $dir/rec.afl.tmpl >> $query_file
done

iquery --afl --query-file $query_file
rm "$query_file"

The script starts by creating the rec array. It extracts the information from the file name in the num variable. The file names used in this case are simple and the data is easy to extract. In practice, more complicated pattern matching might be needed. The variable query_file holds the name of a temporary file which contains the queries needed to load all the data files. These queries are run against the database at the end of the script and the temporary file is removed. The output from executing the script looks like this:

# ./rec.sh
Query was executed successfully
{tuple_no,dst_instance_id,src_instance_id} a0,error
{0,0,0} '10',null
{1,0,0} '20',null
{2,0,0} '30',null
{line,num} val
{0,1} 10
{1,1} 20
{2,1} 30
{tuple_no,dst_instance_id,src_instance_id} a0,error
{0,0,0} '40',null
{1,0,0} '50',null
{2,0,0} '60',null
{line,num} val
{0,1} 10
{0,2} 40
{1,1} 20
{1,2} 50
{2,1} 30
{2,2} 60

The last {line,num} val listing contains the final contents of the rec array created in the beginning of the post. The array holds the data from both input files with the proper line and num dimensions set. In practice, you might want to run the iquery command with the --no-fetch argument so that data is not fetched and printed at the output.

Please note that while we have a general process for loading multiple files in SciDB, we are not taking advantage of the distributed loading capabilities of the accelerated_io_tools plugin. We are essentially loading the data, one file at a time.

The input data, the query template, and the Bash script are available here.