In part 1 of this multi-part post, we gave a short overview to the accelerated_io_tools
plugin provided by Paradigm4 and showed how to automate loading multiple files by capturing both the data in the file as well as the data in the file name. In this post, we go a step further and show how to capture additional non-integer data from the file name. We use the non-integer data to simulate a non-integer dimension (by using an additional reference array).
Multiple Files with Non-Integer Metadata
Suppose for example that our data is in three files as follows:
# cat rec-A-1.txt
10
20
30
# cat rec-A-2.txt
40
50
60
# cat rec-B-1.txt
70
80
90
We would like to load both the data in the files, as well as the data in the file names (i.e., A
, B
, 1
and 2
). The destination array has one attribute and three dimensions. The file line number of the value is used as the first dimension, line
. The letter in the file name is used as the second dimension, letter
. An additional reference array is used to store the string value of the letters. Finally, the number in the file name is used as the third dimension, num
. The schema of the destination array, rec
, and the reference array, rec_letter
, look like this:
# iquery --afl
AFL% create array rec<val:int64> [line, letter, num];
Query was executed successfully
AFL% create array rec_letter<val:string> [letter];
Query was executed successfully
Load One File
Let’s have a look at how to load data from one file. For example, let’s load the data from the rec-A-1.txt
file. First, we have to insert A
value into the reference array, rec_letter
. To do this, we first have to build
(see documentation) operator to build an array with A
. Then we re-dimension this new array to match the shape of the rec_letter
array:
# iquery --afl
AFL% insert(
redimension(
build(<val:string> [letter=0:0,?,?], 'A'),
rec_letter),
rec_letter);
{letter} val
{0} 'A'
Now we use the value 0
for the letter
dimension and insert the rest of the data in the rec
array:
# iquery --afl
AFL% insert(
redimension(
apply(
aio_input('/rec-A-1.txt', 'num_attributes=1'),
val, int64(a0),
line, tuple_no,
letter, 0,
num, 1),
rec),
rec);
{line,letter,num} val
{0,0,1} 10
{1,0,1} 20
{2,0,1} 30
Notice how the three dimensions (line
, letter
, and num
) and the attribute (val
) are set in the destination array. We can display the actual letter value using a cross_join
(see documentation) between the rec
and the rec_letter
arrays:
# iquery --afl
AFL% cross_join(
rec,
rec_letter,
rec.letter,
rec_letter.letter);
{line,letter,num} val,val
{0,0,1} 10,'A'
{1,0,1} 20,'A'
{2,0,1} 30,'A'
Look-up in Reference Array
A first step to generalize this would be to look up the A
letter in the rec_letter
array and use its index instead of hard-coding it. We use the index_lookup
operator (see documentation). This operator requires an input array, so we build an array around the A
value:
# iquery --afl
AFL% index_lookup(
build(<k:string> [i=0:0,?,?], 'A'),
rec_letter, k);
{i} k,k_index
{0} 'A',0
The k_index
contains the value that we need for the letter
dimension. The value would be null
if A
is not found in rec_letter
. To append the k_index
value to the data read from the file, we have to do a cross_join
. We then use apply
to rename k_index
to letter
. The final query looks like this:
# iquery --afl
AFL% insert(
redimension(
apply(
cross_join(
aio_input('/rec-A-1.txt', 'num_attributes=1'),
index_lookup(
build(<k:string> [i=0:0,?,?], 'A'),
rec_letter, k)),
val, int64(a0),
line, tuple_no,
letter, k_index,
num, 1),
rec),
rec);
{line,letter,num} val
{0,0,1} 10
{1,0,1} 20
{2,0,1} 30
The result is identical as the one of the previous insert
query, except that we did not hard-code the 0
value for the letter
dimension.
Insert Sequentially If Not Found
The next step in generalizing this load is to remove the hard-coded 0
dimension value we used to insert A
in the rec_letter
array. Instead, we assume that “letters” are sequentially inserted into the rec_letter
array starting at dimension 0
. So, new “letters” should be inserted at the “end” of the array. We do this by counting how many cells are in the rec_letter
array and inserting the new “letter” at the “count” position. In SciDB, this is done using the cross_join
and apply
operators:
# iquery --afl
AFL% insert(
redimension(
apply(
cross_join(
build(<val:string> [i=0:0,?,?], 'A'),
aggregate(rec_letter, count(*))),
letter, int64(count)),
rec_letter),
rec_letter);
{letter} val
{0} 'A'
Notice that we used i
for dimension name in the build
operator because we need to use letter
later when assigning the count. Also, notice that if we run this query multiple times, multiple copies of A
will be inserted in the rec_letter
array, each time at an increasing position on the letter
dimension (i.e., 1
, 2
, etc.). The final step to generalize loading this file is to insert the letter A
in the rec_letter
array only if it does not already exist. We do this by using the index_lookup
and the filter
operators (see documentation). That is, we first search for A
in rec_letter
. If A
is not found, the index attribute will be set to null
. Using filter
we filter the resulting array and only retain records which have the index attribute set to null
. In other words, if A
already exists in the array, the insert
query inserts 0
new cells (since they are filtered), otherwise, it inserts 1
new cell. The query is:
# iquery --afl
AFL% insert(
redimension(
apply(
cross_join(
filter(
index_lookup(
build(<k:string> [i=0:0,?,?], 'A'),
rec_letter, k),
k_index is null),
aggregate(rec_letter, count(*))),
val, k,
letter, int64(count)),
rec_letter),
rec_letter);
{letter} val
{0} 'A'
Notice that we used k
for the attribute name in the build
operator in order to avoid name collision when calling the index_lookup
operator. Re-running this query multiple times does not result in multiple copies of A
being inserted. For completeness, the final and most general two queries to load one data file are:
# iquery --afl
AFL% insert(
redimension(
apply(
cross_join(
filter(
index_lookup(
build(<k:string> [i=0:0,?,?], 'A'),
rec_letter, k),
k_index is null),
aggregate(rec_letter, count(*))),
val, k,
letter, int64(count)),
rec_letter),
rec_letter);
{letter} val
{0} 'A'
AFL% insert(
redimension(
apply(
cross_join(
aio_input('/rec-A-1.txt', 'num_attributes=1'),
index_lookup(
build(<k:string> [i=0:0,?,?], 'A'),
rec_letter, k)),
val, int64(a0),
line, tuple_no,
letter, k_index,
num, 1),
rec),
rec);
{line,letter,num} val
{0,0,1} 10
{1,0,1} 20
{2,0,1} 30
Notice that the only information hard-coded in the query is the A
(in the two build
operators) and the 1
(assigned to num
in the last apply
operator) extracted from the file name.
Automated Loading
Now that we have a general query for loading one file, we automate the process for loading a possibly large number of files. The process is very similar with the one described in part 1 of this multi-part post. Essentially, we need a query template for inserting data from one file and a bash script for iterating over the files. The template query is identical to the one built in the previous section with three parameters. One parameter is the file name ($T_FILE
) and the other two parameters are one each for each of the file meta-data arguments ($T_LETTER
and $T_NUM
):
# cat rec.afl.tmpl
insert(
redimension(
apply(
cross_join(
filter(
index_lookup(
build(<k:string> [i=0:0,?,?], '$T_LETTER'),
rec_letter, k),
k_index is null),
aggregate(rec_letter, count(*))),
val, k,
letter, int64(count)),
rec_letter),
rec_letter);
insert(
redimension(
apply(
cross_join(
aio_input('$T_FILE', 'num_attributes=1'),
index_lookup(
build(<k:string> [i=0:0,?,?], '$T_LETTER'),
rec_letter, k)),
val, int64(a0),
line, tuple_no,
letter, k_index,
num, $T_NUM),
rec),
rec);
Compared with the Bash script used in part 1, the Bash script we use here extracts the two meta-data value from the file name (the “letter” and the “number”), and uses them to parameterize the query template:
# cat rec.sh
#!/bin/bash
iquery --afl --query \
'create array rec<val:int64> [line, letter, num];
create array rec_letter<val:string> [letter]'
dir=$(dirname $(readlink -f $0))
query_file=`mktemp`
for file in $dir/rec-*.txt
do
num=${file//[^0-9]/}
letter=${file//[^A-Z]/}
env T_FILE="$file" T_NUM="$num" T_LETTER="$letter" envsubst \
< $dir/rec.afl.tmpl >> $query_file
done
iquery --afl --query-file $query_file
rm "$query_file"
Note that we are using very simple pattern matching expressions to extract the meta-data from the file names (num
and letter
variables). In practice these might need to be adjusted. The output from executing the script looks like this:
# ./rec.sh
Query was executed successfully
Query was executed successfully
{letter} val
{0} 'A'
{line,letter,num} val
{0,0,1} 10
{1,0,1} 20
{2,0,1} 30
{letter} val
{0} 'A'
{line,letter,num} val
{0,0,1} 10
{0,0,2} 40
{1,0,1} 20
{1,0,2} 50
{2,0,1} 30
{2,0,2} 60
{letter} val
{0} 'A'
{1} 'B'
{line,letter,num} val
{0,0,1} 10
{0,0,2} 40
{0,1,1} 70
{1,0,1} 20
{1,0,2} 50
{1,1,1} 80
{2,0,1} 30
{2,0,2} 60
{2,1,1} 90
Notice that even if we have two input files with A
in the filename (rec-A-1.txt
and rec-A-2.txt
), A
only appears once in the rec_letter
array. Also, the two arrays (rec_letter
and rec
) are listed in the output after each insert query, so you can see then “growing” in the output.
The input data, the query template, and the Bash script are available here.