Loading data into a database is a very important operation. SciDB is no exception. Build-in SciDB is the vanilla load
operator documented here. On top of that Paradigm4 provides an advanced loading operator as part of their accelerated_io_tools plugin. Similar to SciDB, this plugin is open-source but different from SciDB, this plugin is on GitHub and contributions are welcome. The loading operator provided in this plugin has a couple of advanced features including:
- Fully distributed parsing and packing
- Loading from multiple files
- Error tolerance
We highly recommend taking the time to understand its usage. The easiest way to install this plugin is to install the dev_tools
(see GitHub) plugin first (Update: see this post for a discussion). Using the dev_tools
plugin, the accelerated_io_tools
plugin can be installed using the AFL query install_github('paradigm4/accelerated_io_tools')
. In this multi-part post, we discuss various use-cases of loading data using this plugin. In this part, we discuss how to load data from multiple files as well as loading the metadata encoded in the file name.
Multiple Files with Metadata
As already mentioned, the loading operator provided in accelerated_io_tools
(i.e., aio_input
) is capable of loading data from multiple files at once (see Loading from multiple files and Load from one or multiple files sections in the plugin documentation). To use the operator, the user is required to list all the file names as part of the operator arguments. Without a doubt, this has its uses and definitely the advantage of running distributed.
On the other hand, imagine a situation where you are loading data from a possibly large number of files. In this case listing all the file names as part of the query is not practical nor possible. Moreover, imagine there is additional metadata encoded in the files names that you would like to capture in the database as well.
Assume, for example, that our data is in two files as follows:
# cat rec-1.txt
10
20
30
# cat rec-2.txt
40
50
60
We would like to load both the data in the files, as well as the metadata in the file names (i.e., 1
and 2
). The destination array has two dimensions. The data in the files goes in the first dimension, line
, while the metadata in the file names goes in the second dimension, num
. The schema of the destination array looks like this:
# iquery --afl
AFL% create array rec<val:int64> [line, num];
Query was executed successfully
Load One File
Let’s see how we can load the data from one file first and then we extend the procedure to multiple files. For loading the data from one file we can start with:
# iquery --afl
AFL% store(
aio_input('/rec-1.txt', 'num_attributes=1'),
rec_file);
{tuple_no,dst_instance_id,src_instance_id} a0,error
{0,0,0} '10',null
{1,0,0} '20',null
{2,0,0} '30',null
Refer to the Trivial end-to-end example in the plugin documentation for understanding the parameters and the output. Notice that the operator requires the full path to the data file. The data is stored temporarily in the rec_file
array. Next, we have to re-dimension it and add the metadata from the file name:
# iquery --afl
AFL% insert(
redimension(
apply(
rec_file,
val, int64(a0),
line, tuple_no,
num, 1),
rec),
rec);
{line,num} val
{0,1} 10
{1,1} 20
{2,1} 30
We did quite a bit in this query, let’s walk step by step. The inner-most operator is an apply
operator (see documentation). This operator takes as input the rec_file
array and prepares the attributes and dimensions for our destination array rec
. Remember, the rec
array has one attribute (val
) and two dimensions (line
and num
). The apply
operator performs the following operations:
- Converts the values read from the file to integers and stores them in the
val
attribute. (... val, int64(a0), ...
) - Copies the
tuple_no
dimension toline
attribute (... line, tuple_no, ...
) - Sets the
num
attribute to1
(... num, 1), ...
)
The apply
operator is followed by the redimension
operator (see documentation). This operator converts the input array to the structure of the rec
array by dropping extra dimensions and attributes, and mapping attributes to dimensions (line
and num
). As an exercise, run only the apply
operator and examine the output. It has three dimensions and five attributes.
The output of the redimension
operator is passed as input to the insert
operator (see documentation) which inserts the data into the rec
array. The resulting rec
array is printed at the console. Notice the values for the attribute (val
) and the two dimensions (line
and num
).
Automated Loading
Now that we know how to load one file as well as store the metadata encoded in the file name, let’s see how we can automate this process and load a possibly large number of files. Essentially, we have to write a Bash script which generates two queries for each input file. The two queries are the store
and insert
queries from above but customized for each file. The customization includes reading the right input file from the disk and setting the right value for the num
dimension (hard-codded to 1
in the example above).
Query Template
The first step is to list the two queries in a file and parametrize them as follows:
# cat rec.afl.tmpl
store(
aio_input('$T_FILE', 'num_attributes=1'),
rec_file);
insert(
redimension(
apply(
rec_file,
val, int64(a0),
line, tuple_no,
num, $T_NUM),
rec),
rec);
The file above is a template file with two parameters $T_FILE
and $T_NUM
. We can merge the two queries in a single query by moving the aio_input
operator inside the apply
operator (replacing rec_file
) and removing the store
operator. We leave this as an exercise to the reader. Using the envsubst
Linux command, we can easily instantiate the template and provide values for these two parameters. The result is a valid query:
# env T_FILE="/rec-1.txt" T_NUM=1 envsubst < rec.afl.tmpl
store(
aio_input('/rec-1.txt', 'num_attributes=1'),
rec_file);
insert(
redimension(
apply(
rec_file,
val, int64(a0),
line, tuple_no,
num, 1),
rec),
rec);
Notice how the template parameters have been replaced. Also, notice that if the files paths and names follow the same convention, we could use only one template parameter, $T_NUM
, and construct the file path from it. Now, we can easily run the resulting query by piping it through iquery
:
# env T_FILE="/rec-1.txt" T_NUM=1 envsubst < rec.afl.tmpl | iquery --afl
AFL%
AFL% store(
aio_input('/rec-1.txt', 'num_attributes=1'),
rec_file);
{tuple_no,dst_instance_id,src_instance_id} a0,error
{0,0,0} '10',null
{1,0,0} '20',null
{2,0,0} '30',null
AFL%
AFL% insert(
redimension(
apply(
rec_file,
val, int64(a0),
line, tuple_no,
num, 1),
rec),
rec);
{line,num} val
{0,1} 10
{1,1} 20
{2,1} 30
Bash Script
Finally, we can write a Bash script which loops over the files and instantiates the query template for each file. The resulting queries are collected into a temporary file which is then run against the database. The Bash script follows:
# cat rec.sh
#!/bin/bash
iquery --afl --query \
'create array rec<val:int64> [line, num]'
dir=$(dirname $(readlink -f $0))
query_file=`mktemp`
for file in $dir/rec-*.txt
do
num=${file//[^0-9]/}
env T_FILE="$file" T_NUM="$num" envsubst \
< $dir/rec.afl.tmpl >> $query_file
done
iquery --afl --query-file $query_file
rm "$query_file"
The script starts by creating the rec
array. It extracts the information from the file name in the num
variable. The file names used in this case are simple and the data is easy to extract. In practice, more complicated pattern matching might be needed. The variable query_file
holds the name of a temporary file which contains the queries needed to load all the data files. These queries are run against the database at the end of the script and the temporary file is removed. The output from executing the script looks like this:
# ./rec.sh
Query was executed successfully
{tuple_no,dst_instance_id,src_instance_id} a0,error
{0,0,0} '10',null
{1,0,0} '20',null
{2,0,0} '30',null
{line,num} val
{0,1} 10
{1,1} 20
{2,1} 30
{tuple_no,dst_instance_id,src_instance_id} a0,error
{0,0,0} '40',null
{1,0,0} '50',null
{2,0,0} '60',null
{line,num} val
{0,1} 10
{0,2} 40
{1,1} 20
{1,2} 50
{2,1} 30
{2,2} 60
The last {line,num} val
listing contains the final contents of the rec
array created in the beginning of the post. The array holds the data from both input files with the proper line
and num
dimensions set. In practice, you might want to run the iquery
command with the --no-fetch
argument so that data is not fetched and printed at the output.
Please note that while we have a general process for loading multiple files in SciDB, we are not taking advantage of the distributed loading capabilities of the accelerated_io_tools
plugin. We are essentially loading the data, one file at a time.
The input data, the query template, and the Bash script are available here.