In part 1 and part 2 of this multi-part post, we looked a how to load data from multiple files while capturing information present in the file name. In this post, we look at how to load data files organized as tables with a possibly large number of columns and header rows.
Simple Table-like Data
First, let us take a look at how to load a data file with a relatively small number of columns. Assume for example that our data has three columns:
# cat rec-1.txt
10 20 30
12 22 32
14 24 34
We use the accelerated_io_tools
plugin. Part 1 of this multi-part post starts with a brief intro to this plugin. We assume that the plugin is installed and loaded. See the plugin documentation for installation and loading instructions. Using the aio_input
operator with the num_attributes
parameter (see documentation) set to 3
we can load our example file like this:
# iquery --afl
AFL% aio_input('/rec-1.txt', 'num_attributes=3');
{tuple_no,dst_instance_id,src_instance_id} a0,a1,a2,error
{0,0,0} '10','20','30',null
{1,0,0} '12','22','32',null
{2,0,0} '14','24','34',null
Notice how the three columns became three attributes, a0
, a1
, and a2
. The result can be converted to an array with one dimension and two numeric attributes using the apply
(see documentation) and redimension
(see documentation) operators:
# iquery --afl
AFL% redimension(
apply(
aio_input('/rec-1.txt', 'num_attributes=3'),
x, int64(a0),
y, int64(a1),
val, int64(a2)),
<x: int64, y: int64, val: int64>[tuple_no]);
{tuple_no} x,y,val
{0} 10,20,30
{1} 12,22,32
{2} 14,24,34
In the case where some of the columns are dimensions, this can be achieved by adjusting the template array used for the redimension
operator:
# iquery --afl
AFL% redimension(
apply(
aio_input('/rec-1.txt', 'num_attributes=3'),
x, int64(a0),
y, int64(a1),
val, int64(a2)),
<val: int64>[x, y, tuple_no]);
{x,y,tuple_no} val
{10,20,0} 30
{12,22,1} 32
{14,24,2} 34
Notice that in the template array used for the redimension
operator x
and y
are no longer attributes, but dimensions.
Large Table-like Data
The queries from the previous section work well for data with a relatively small number of columns where the columns can be enumerated and manipulated directly. In the case where the data has a large number of columns (i.e., tens or hundreds of columns), enumerating and manipulating the columns directly is not practical. A more practical solution is to have an additional dimension along the columns of the data. The aio_input
operator comes with a split_on_dimension
parameter (see documentation) which allows us to do exactly that. Let us load the same data file from the previous section:
# iquery --afl
AFL% aio_input('/rec-1.txt', 'num_attributes=3', 'split_on_dimension=1');
{tuple_no,dst_instance_id,src_instance_id,attribute_no} a
{0,0,0,0} '10'
{0,0,0,1} '20'
{0,0,0,2} '30'
{0,0,0,3} null
{1,0,0,0} '12'
{1,0,0,1} '22'
{1,0,0,2} '32'
{1,0,0,3} null
{2,0,0,0} '14'
{2,0,0,1} '24'
{2,0,0,2} '34'
{2,0,0,3} null
Notice how we have a fourth dimension, attribute_no
, which stores the column index for each data element. Assuming our original data is a matrix, we can redimension the result like this:
# iquery --afl
AFL% redimension(
between(
apply(
aio_input('/rec-1.txt', 'num_attributes=3', 'split_on_dimension=1'),
i, tuple_no,
j, attribute_no,
val, int64(a)),
null, null, null, null,
null, null, null, 2),
<val:int64>[i, j]);
{i,j} val
{0,0} 10
{0,1} 20
{0,2} 30
{1,0} 12
{1,1} 22
{1,2} 32
{2,0} 14
{2,1} 24
{2,2} 34
Besides the apply
and redimension
operators, we also use the between
operator (see documentation). The between
operator helps us select just the data positions of the attribute_no
dimension and discard the last position which stores the data-loading errors.
Notice that, even if the example data is small, we do not enumerate the columns of the data, like in the previous section, instead, we just use their count. So, the examples presented in this section could be used for use-cases where the input data has a large number of columns.
Large Table-like Data with Header Row
Let us take our example a step further and assume that our data contains a header row with reference values for each column. For example, the header could contain the column position on a continuous dimension:
# cat rec-2-1.txt
1.1 1.3 1.5
10 20 30
12 22 32
14 24 34
1.1 | 1.3 | 1.5 |
---|---|---|
10 |
20 |
30 |
12 |
22 |
32 |
14 |
24 |
34 |
Moreover, different data files can contain different sets of columns. For example:
# cat rec-2-2.txt
1.1 1.2 1.3
16 26 36
18 28 38
1.1 | 1.2 | 1.3 |
---|---|---|
16 |
26 |
36 |
18 |
28 |
38 |
The goal is to keep track of the column name for each data value. So, in SciDB, the two examples would be represented in arrays like this:
Row\Col | 1 | 2 | 3 |
---|---|---|---|
1 | (1.1, 10) |
(1.3, 20) |
(1.5, 30) |
2 | (1.1, 12) |
(1.3, 22) |
(1.5, 32) |
3 | (1.1, 14) |
(1.3, 24) |
(1.5, 34) |
and:
Row\Col | 1 | 2 | 3 |
---|---|---|---|
1 | (1.1, 16) |
(1.2, 26) |
(1.3, 36) |
2 | (1.1, 18) |
(1.2, 28) |
(1.3, 38) |
As such, let us first create the array for storing this data:
# iquery --afl
AFL% create array rec<pos:float, val:int64>[row, col];
Query was executed successfully
Moreover, we assume that the number of columns is potentially large and avoid referring to or manipulating columns directly. We start by loading the entire file into a temporary array:
# iquery --afl
AFL% store(
aio_input('/rec-2-1.txt', 'num_attributes=3', 'split_on_dimension=1'),
rec_file);
{tuple_no,dst_instance_id,src_instance_id,attribute_no} a
{0,0,0,0} '1.1'
{0,0,0,1} '1.3'
{0,0,0,2} '1.5'
{0,0,0,3} null
{1,0,0,0} '10'
{1,0,0,1} '20'
{1,0,0,2} '30'
{1,0,0,3} null
{2,0,0,0} '12'
{2,0,0,1} '22'
{2,0,0,2} '32'
{2,0,0,3} null
{3,0,0,0} '14'
{3,0,0,1} '24'
{3,0,0,2} '34'
{3,0,0,3} null
Next, we extract only the header of the table and store it in another array:
# iquery --afl
AFL% store(
redimension(
apply(
between(
rec_file,
0, null, null, null,
0, null, null, 2),
pos, float(a),
col, attribute_no),
<pos:float>[col]),
rec_head);
{col} pos
{0} 1.1
{1} 1.3
{2} 1.5
Notice how we use the between
operator to extract the values where tuple_no
is 0
and we only extract the first three positions of the attribute_no
dimension (null
to 2
). We do the same for the body of the table:
# iquery --afl
AFL% store(
redimension(
apply(
between(
rec_file,
1, null, null, null,
null, null, null, 2),
val, int64(a),
row, tuple_no,
col, attribute_no),
<val:int64>[row, col]),
rec_body);
{row, col} val
{1,0} 10
{1,1} 20
{1,2} 30
{2,0} 12
{2,1} 22
{2,2} 32
{3,0} 14
{3,1} 24
{3,2} 34
Finally, we obtain the desired array with the help of the cross_join
operator (see documentation):
# iquery --afl
AFL% store(
redimension(
cross_join(rec_body, rec_head, rec_body.col, rec_head.col),
rec),
rec);
{row,col} pos,val
{1,0} 1.1,10
{1,1} 1.3,20
{1,2} 1.5,30
{2,0} 1.1,12
{2,1} 1.3,22
{2,2} 1.5,32
{3,0} 1.1,14
{3,1} 1.3,24
{3,2} 1.5,34
For optimal performance, we place the larger array first when calling the cross_join
operator. Notice that the columns start at index 0
but the rows start at index 1
. This is because the body of the table starts at index 1
. Adjusting the columns or the rows so that they start at the same index can be done using the apply
operator. We leave this as an exercise to the reader.
Buffer Size Caveat
In the previous queries, we assume that the tuple_no
values returned by the aio_input
are unique and dense. Let us have a look at another example where the data file is slightly bigger:
# cat rec-3.txt
10 20 30
11 21 31
12 22 32
13 23 33
14 24 34
15 25 35
We load this data using the same aio_input
operator, but we specify a small value for the buffer_size
parameter (see documentation):
# iquery --afl
AFL% aio_input('/rec-3.txt', 'num_attributes=3', 'buffer_size=10');
{tuple_no,dst_instance_id,src_instance_id} a0,a1,a2,error
{0,0,0} '10','20','30',null
{1,0,0} '11','21','31',null
{0,1,0} '12','22','32',null
{10,0,0} '13','23','33',null
{10,1,0} '14','24','34',null
{20,0,0} '15','25','35',null
Let us examine the tuple_no
values. They are neither unique nor dense. To understand what happened, we need to understand how the buffer_size
parameter works. buffer_size
specifies the size (in bytes) of a buffer used to split the input data. Each buffer of data is distributed across the cluster (in a round-robin fashion) and loaded in an array chunk. If the chunk_size
parameter (see documentation) is not explicitly specified, it is set to the value of the block_size
.
In our example, the buffer_size
is set to 10 bytes
, the chunk_size
is implicitly set to 10
, and we have two SciDB instances in our cluster. As a consequence, the first two lines in the file fill the first buffer and are loaded by the first SciDB instance, in its first array chunk. Notice the tuple_no
set to 0
and 1
, respectively, and the dst_instance_id
set to 0
. The third line in the file fills the second buffer and it is loaded by the second SciDB instance, in its first array chunk. Next, the fourth line is loaded by the first SciDB instance in its second array chunk. Notice the tuple_no
set to 10
(because chunk_size
is 10
) and the dst_instance_id
set to 0
. The process continues with the fifth line loaded on the second instance, in its second chunk, and the sixth line loaded on the first instance, in its third chunk. The table below shows in which instance and chunk each line ends up:
Chunk\Instance | 0 |
1 |
1st {0,... |
'10',... '11',... |
'12',... |
2nd {10,... |
'13',... |
'14',... |
3rd {20,... |
'15',... |
We use this data-loading logic in order to generate a unique and dense set of values associated with each row (corresponding to their original order in the input file).
we construct three new attributes using the tuple_no
and dst_instance_id
dimensions and sort
the array (see documentation) on these new attributes:
# iquery --afl
AFL% sort(
apply(
aio_input('/rec-3.txt', 'num_attributes=3', 'buffer_size=10'),
chunk_no, tuple_no / 10,
inst, dst_instance_id,
chunk_idx, tuple_no % 10),
chunk_no, inst, chunk_idx);
{n} a0,a1,a2,error,chunk_no,inst,chunk_idx
{0} '10','20','30',null,0,0,0
{1} '11','21','31',null,0,0,1
{2} '12','22','32',null,0,1,0
{3} '13','23','33',null,1,0,0
{4} '14','24','34',null,1,1,0
{5} '15','25','35',null,2,0,0
The first new attribute, chunk_no
keeps track of the chunk number to which a record belongs to. inst
keeps track of the instance on which a record was loaded on. Finally, chunk_idx
keeps track of the index of a record inside a specific chunk at a specific instance. The value 10
represents the implicit chunk_size
used in our example. The result of the sort
operator is a one-dimensional array where dimension n
stores the unique and dense record indexes we need.
The default buffer_size
is 8MB
and the default chunk_size
is 10,000,000
. Depending on the input data and the values set for these parameters, the tuple_no
dimension might end up containing unique and dense values and could be used directly, but this should be verified.
The input data and the queries are available here.