In part 1 of this multi-part post, we gave a short overview to the accelerated_io_tools plugin provided by Paradigm4 and showed how to automate loading multiple files by capturing both the data in the file as well as the data in the file name. In this post, we go a step further and show how to capture additional non-integer data from the file name. We use the non-integer data to simulate a non-integer dimension (by using an additional reference array).

Multiple Files with Non-Integer Metadata

Suppose for example that our data is in three files as follows:

# cat rec-A-1.txt
10
20
30
# cat rec-A-2.txt
40
50
60
# cat rec-B-1.txt
70
80
90

We would like to load both the data in the files, as well as the data in the file names (i.e., A, B, 1 and 2). The destination array has one attribute and three dimensions. The file line number of the value is used as the first dimension, line. The letter in the file name is used as the second dimension, letter. An additional reference array is used to store the string value of the letters. Finally, the number in the file name is used as the third dimension, num. The schema of the destination array, rec, and the reference array, rec_letter, look like this:

# iquery --afl
AFL% create array rec<val:int64> [line, letter, num];
Query was executed successfully
AFL% create array rec_letter<val:string> [letter];
Query was executed successfully

Load One File

Let’s have a look at how to load data from one file. For example, let’s load the data from the rec-A-1.txt file. First, we have to insert A value into the reference array, rec_letter. To do this, we first have to build (see documentation) operator to build an array with A. Then we re-dimension this new array to match the shape of the rec_letter array:

# iquery --afl
AFL% insert(
       redimension(
         build(<val:string> [letter=0:0,?,?], 'A'),
         rec_letter),
       rec_letter);
{letter} val
{0} 'A'

Now we use the value 0 for the letter dimension and insert the rest of the data in the rec array:

# iquery --afl
AFL% insert(
       redimension(
         apply(
           aio_input('/rec-A-1.txt', 'num_attributes=1'),
           val, int64(a0),
           line, tuple_no,
           letter, 0,
           num, 1),
         rec),
       rec);
{line,letter,num} val
{0,0,1} 10
{1,0,1} 20
{2,0,1} 30

Notice how the three dimensions (line, letter, and num) and the attribute (val) are set in the destination array. We can display the actual letter value using a cross_join (see documentation) between the rec and the rec_letter arrays:

# iquery --afl
AFL% cross_join(
       rec,
       rec_letter,
       rec.letter,
       rec_letter.letter);
{line,letter,num} val,val
{0,0,1} 10,'A'
{1,0,1} 20,'A'
{2,0,1} 30,'A'

Look-up in Reference Array

A first step to generalize this would be to look up the A letter in the rec_letter array and use its index instead of hard-coding it. We use the index_lookup operator (see documentation). This operator requires an input array, so we build an array around the A value:

# iquery --afl
AFL% index_lookup(
       build(<k:string> [i=0:0,?,?], 'A'),
       rec_letter, k);
{i} k,k_index
{0} 'A',0

The k_index contains the value that we need for the letter dimension. The value would be null if A is not found in rec_letter. To append the k_index value to the data read from the file, we have to do a cross_join. We then use apply to rename k_index to letter. The final query looks like this:

# iquery --afl
AFL% insert(
       redimension(
         apply(
           cross_join(
             aio_input('/rec-A-1.txt', 'num_attributes=1'),
             index_lookup(
               build(<k:string> [i=0:0,?,?], 'A'),
               rec_letter, k)),
           val, int64(a0),
           line, tuple_no,
           letter, k_index,
           num, 1),
         rec),
       rec);
{line,letter,num} val
{0,0,1} 10
{1,0,1} 20
{2,0,1} 30

The result is identical as the one of the previous insert query, except that we did not hard-code the 0 value for the letter dimension.

Insert Sequentially If Not Found

The next step in generalizing this load is to remove the hard-coded 0 dimension value we used to insert A in the rec_letter array. Instead, we assume that “letters” are sequentially inserted into the rec_letter array starting at dimension 0. So, new “letters” should be inserted at the “end” of the array. We do this by counting how many cells are in the rec_letter array and inserting the new “letter” at the “count” position. In SciDB, this is done using the cross_join and apply operators:

# iquery --afl
AFL% insert(
       redimension(
         apply(
           cross_join(
             build(<val:string> [i=0:0,?,?], 'A'),
             aggregate(rec_letter, count(*))),
           letter, int64(count)),
         rec_letter),
       rec_letter);
{letter} val
{0} 'A'

Notice that we used i for dimension name in the build operator because we need to use letter later when assigning the count. Also, notice that if we run this query multiple times, multiple copies of A will be inserted in the rec_letter array, each time at an increasing position on the letter dimension (i.e., 1, 2, etc.). The final step to generalize loading this file is to insert the letter A in the rec_letter array only if it does not already exist. We do this by using the index_lookup and the filter operators (see documentation). That is, we first search for A in rec_letter. If A is not found, the index attribute will be set to null. Using filter we filter the resulting array and only retain records which have the index attribute set to null. In other words, if A already exists in the array, the insert query inserts 0 new cells (since they are filtered), otherwise, it inserts 1 new cell. The query is:

# iquery --afl
AFL% insert(
       redimension(
         apply(
           cross_join(
             filter(
               index_lookup(
                 build(<k:string> [i=0:0,?,?], 'A'),
                 rec_letter, k),
               k_index is null),
             aggregate(rec_letter, count(*))),
           val, k,
           letter, int64(count)),
         rec_letter),
       rec_letter);
{letter} val
{0} 'A'

Notice that we used k for the attribute name in the build operator in order to avoid name collision when calling the index_lookup operator. Re-running this query multiple times does not result in multiple copies of A being inserted. For completeness, the final and most general two queries to load one data file are:

# iquery --afl
AFL% insert(
       redimension(
         apply(
           cross_join(
             filter(
               index_lookup(
                 build(<k:string> [i=0:0,?,?], 'A'),
                 rec_letter, k),
               k_index is null),
             aggregate(rec_letter, count(*))),
           val, k,
           letter, int64(count)),
         rec_letter),
       rec_letter);
{letter} val
{0} 'A'
AFL% insert(
       redimension(
         apply(
           cross_join(
             aio_input('/rec-A-1.txt', 'num_attributes=1'),
             index_lookup(
               build(<k:string> [i=0:0,?,?], 'A'),
               rec_letter, k)),
           val, int64(a0),
           line, tuple_no,
           letter, k_index,
           num, 1),
         rec),
       rec);
{line,letter,num} val
{0,0,1} 10
{1,0,1} 20
{2,0,1} 30

Notice that the only information hard-coded in the query is the A (in the two build operators) and the 1 (assigned to num in the last apply operator) extracted from the file name.

Automated Loading

Now that we have a general query for loading one file, we automate the process for loading a possibly large number of files. The process is very similar with the one described in part 1 of this multi-part post. Essentially, we need a query template for inserting data from one file and a bash script for iterating over the files. The template query is identical to the one built in the previous section with three parameters. One parameter is the file name ($T_FILE) and the other two parameters are one each for each of the file meta-data arguments ($T_LETTER and $T_NUM):

# cat rec.afl.tmpl

insert(
  redimension(
    apply(
      cross_join(
        filter(
          index_lookup(
            build(<k:string> [i=0:0,?,?], '$T_LETTER'),
            rec_letter, k),
          k_index is null),
        aggregate(rec_letter, count(*))),
      val, k,
      letter, int64(count)),
    rec_letter),
  rec_letter);

insert(
  redimension(
    apply(
      cross_join(
        aio_input('$T_FILE', 'num_attributes=1'),
        index_lookup(
          build(<k:string> [i=0:0,?,?], '$T_LETTER'),
          rec_letter, k)),
      val, int64(a0),
      line, tuple_no,
      letter, k_index,
      num, $T_NUM),
    rec),
  rec);

Compared with the Bash script used in part 1, the Bash script we use here extracts the two meta-data value from the file name (the “letter” and the “number”), and uses them to parameterize the query template:

# cat rec.sh
#!/bin/bash

iquery --afl --query \
       'create array rec<val:int64> [line, letter, num];
        create array rec_letter<val:string> [letter]'

dir=$(dirname $(readlink -f $0))
query_file=`mktemp`

for file in $dir/rec-*.txt
do
    num=${file//[^0-9]/}
    letter=${file//[^A-Z]/}

    env T_FILE="$file" T_NUM="$num" T_LETTER="$letter" envsubst \
        < $dir/rec.afl.tmpl >> $query_file
done

iquery --afl --query-file $query_file
rm "$query_file"

Note that we are using very simple pattern matching expressions to extract the meta-data from the file names (num and letter variables). In practice these might need to be adjusted. The output from executing the script looks like this:

# ./rec.sh
Query was executed successfully
Query was executed successfully
{letter} val
{0} 'A'
{line,letter,num} val
{0,0,1} 10
{1,0,1} 20
{2,0,1} 30
{letter} val
{0} 'A'
{line,letter,num} val
{0,0,1} 10
{0,0,2} 40
{1,0,1} 20
{1,0,2} 50
{2,0,1} 30
{2,0,2} 60
{letter} val
{0} 'A'
{1} 'B'
{line,letter,num} val
{0,0,1} 10
{0,0,2} 40
{0,1,1} 70
{1,0,1} 20
{1,0,2} 50
{1,1,1} 80
{2,0,1} 30
{2,0,2} 60
{2,1,1} 90

Notice that even if we have two input files with A in the filename (rec-A-1.txt and rec-A-2.txt), A only appears once in the rec_letter array. Also, the two arrays (rec_letter and rec) are listed in the output after each insert query, so you can see then “growing” in the output.

The input data, the query template, and the Bash script are available here.