In part 1 and part 2 of this multi-part post, we looked a how to load data from multiple files while capturing information present in the file name. In this post, we look at how to load data files organized as tables with a possibly large number of columns and header rows.

Simple Table-like Data

First, let us take a look at how to load a data file with a relatively small number of columns. Assume for example that our data has three columns:

# cat rec-1.txt
20	30
22	32
24	34

We use the accelerated_io_tools plugin. Part 1 of this multi-part post starts with a brief intro to this plugin. We assume that the plugin is installed and loaded. See the plugin documentation for installation and loading instructions. Using the aio_input operator with the num_attributes parameter (see documentation) set to 3 we can load our example file like this:

# iquery --afl
AFL% aio_input('/rec-1.txt', 'num_attributes=3');
{tuple_no,dst_instance_id,src_instance_id} a0,a1,a2,error
{0,0,0} '10','20','30',null
{1,0,0} '12','22','32',null
{2,0,0} '14','24','34',null

Notice how the three columns became three attributes, a0, a1, and a2. The result can be converted to an array with one dimension and two numeric attributes using the apply (see documentation) and redimension (see documentation) operators:

# iquery --afl
AFL% redimension(
       apply(
         aio_input('/rec-1.txt', 'num_attributes=3'),
         x,   int64(a0),
         y,   int64(a1),
         val, int64(a2)),
       <x: int64, y: int64, val: int64>[tuple_no]);
{tuple_no} x,y,val
{0} 10,20,30
{1} 12,22,32
{2} 14,24,34

In the case where some of the columns are dimensions, this can be achieved by adjusting the template array used for the redimension operator:

# iquery --afl
AFL% redimension(
       apply(
         aio_input('/rec-1.txt', 'num_attributes=3'),
         x, int64(a0),
         y, int64(a1),
         val, int64(a2)),
       <val: int64>[x, y, tuple_no]);
{x,y,tuple_no} val
{10,20,0} 30
{12,22,1} 32
{14,24,2} 34

Notice that in the template array used for the redimension operator x and y are no longer attributes, but dimensions.

Large Table-like Data

The queries from the previous section work well for data with a relatively small number of columns where the columns can be enumerated and manipulated directly. In the case where the data has a large number of columns (i.e., tens or hundreds of columns), enumerating and manipulating the columns directly is not practical. A more practical solution is to have an additional dimension along the columns of the data. The aio_input operator comes with a split_on_dimension parameter (see documentation) which allows us to do exactly that. Let us load the same data file from the previous section:

# iquery --afl
AFL% aio_input('/rec-1.txt', 'num_attributes=3', 'split_on_dimension=1');
{tuple_no,dst_instance_id,src_instance_id,attribute_no} a
{0,0,0,0} '10'
{0,0,0,1} '20'
{0,0,0,2} '30'
{0,0,0,3} null
{1,0,0,0} '12'
{1,0,0,1} '22'
{1,0,0,2} '32'
{1,0,0,3} null
{2,0,0,0} '14'
{2,0,0,1} '24'
{2,0,0,2} '34'
{2,0,0,3} null

Notice how we have a fourth dimension, attribute_no, which stores the column index for each data element. Assuming our original data is a matrix, we can redimension the result like this:

# iquery --afl
AFL% redimension(
       between(
         apply(
           aio_input('/rec-1.txt', 'num_attributes=3', 'split_on_dimension=1'),
           i,   tuple_no,
           j,   attribute_no,
           val, int64(a)),
         null, null, null, null,
         null, null, null, 2),
       <val:int64>[i, j]);
{i,j} val
{0,0} 10
{0,1} 20
{0,2} 30
{1,0} 12
{1,1} 22
{1,2} 32
{2,0} 14
{2,1} 24
{2,2} 34

Besides the apply and redimension operators, we also use the between operator (see documentation). The between operator helps us select just the data positions of the attribute_no dimension and discard the last position which stores the data-loading errors.

Notice that, even if the example data is small, we do not enumerate the columns of the data, like in the previous section, instead, we just use their count. So, the examples presented in this section could be used for use-cases where the input data has a large number of columns.

Large Table-like Data with Header Row

Let us take our example a step further and assume that our data contains a header row with reference values for each column. For example, the header could contain the column position on a continuous dimension:

# cat rec-2-1.txt
1	1.3	1.5
20	30
22	32
24	34

1.1	1.3	1.5
`10`	`20`	`30`
`12`	`22`	`32`
`14`	`24`	`34`

Moreover, different data files can contain different sets of columns. For example:

# cat rec-2-2.txt
1	1.2	1.3
26	36
28	38

1.1	1.2	1.3
`16`	`26`	`36`
`18`	`28`	`38`

The goal is to keep track of the column name for each data value. So, in SciDB, the two examples would be represented in arrays like this:

Row\Col	1	2	3
1	`(1.1, 10)`	`(1.3, 20)`	`(1.5, 30)`
2	`(1.1, 12)`	`(1.3, 22)`	`(1.5, 32)`
3	`(1.1, 14)`	`(1.3, 24)`	`(1.5, 34)`

and:

Row\Col	1	2	3
1	`(1.1, 16)`	`(1.2, 26)`	`(1.3, 36)`
2	`(1.1, 18)`	`(1.2, 28)`	`(1.3, 38)`

As such, let us first create the array for storing this data:

# iquery --afl
AFL% create array rec<pos:float, val:int64>[row, col];
Query was executed successfully

Moreover, we assume that the number of columns is potentially large and avoid referring to or manipulating columns directly. We start by loading the entire file into a temporary array:

# iquery --afl
AFL% store(
       aio_input('/rec-2-1.txt', 'num_attributes=3', 'split_on_dimension=1'),
       rec_file);
{tuple_no,dst_instance_id,src_instance_id,attribute_no} a
{0,0,0,0} '1.1'
{0,0,0,1} '1.3'
{0,0,0,2} '1.5'
{0,0,0,3} null
{1,0,0,0} '10'
{1,0,0,1} '20'
{1,0,0,2} '30'
{1,0,0,3} null
{2,0,0,0} '12'
{2,0,0,1} '22'
{2,0,0,2} '32'
{2,0,0,3} null
{3,0,0,0} '14'
{3,0,0,1} '24'
{3,0,0,2} '34'
{3,0,0,3} null

Next, we extract only the header of the table and store it in another array:

# iquery --afl
AFL% store(
       redimension(
         apply(
           between(
             rec_file,
             0, null, null, null,
             0, null, null, 2),
           pos, float(a),
           col, attribute_no),
         <pos:float>[col]),
       rec_head);
{col} pos
{0} 1.1
{1} 1.3
{2} 1.5

Notice how we use the between operator to extract the values where tuple_no is 0 and we only extract the first three positions of the attribute_no dimension (null to 2). We do the same for the body of the table:

# iquery --afl
AFL% store(
       redimension(
         apply(
           between(
             rec_file,
             1,    null, null, null,
             null, null, null, 2),
           val, int64(a),
           row, tuple_no,
           col, attribute_no),
         <val:int64>[row, col]),
       rec_body);
{row, col} val
{1,0} 10
{1,1} 20
{1,2} 30
{2,0} 12
{2,1} 22
{2,2} 32
{3,0} 14
{3,1} 24
{3,2} 34

Finally, we obtain the desired array with the help of the cross_join operator (see documentation):

# iquery --afl
AFL% store(
       redimension(
         cross_join(rec_body, rec_head,  rec_body.col, rec_head.col),
         rec),
       rec);
{row,col} pos,val
{1,0} 1.1,10
{1,1} 1.3,20
{1,2} 1.5,30
{2,0} 1.1,12
{2,1} 1.3,22
{2,2} 1.5,32
{3,0} 1.1,14
{3,1} 1.3,24
{3,2} 1.5,34

For optimal performance, we place the larger array first when calling the cross_join operator. Notice that the columns start at index 0 but the rows start at index 1. This is because the body of the table starts at index 1. Adjusting the columns or the rows so that they start at the same index can be done using the apply operator. We leave this as an exercise to the reader.

Buffer Size Caveat

In the previous queries, we assume that the tuple_no values returned by the aio_input are unique and dense. Let us have a look at another example where the data file is slightly bigger:

# cat rec-3.txt
20	30
21	31
22	32
23	33
24	34
25	35

We load this data using the same aio_input operator, but we specify a small value for the buffer_size parameter (see documentation):

# iquery --afl
AFL% aio_input('/rec-3.txt', 'num_attributes=3', 'buffer_size=10');
{tuple_no,dst_instance_id,src_instance_id} a0,a1,a2,error
{0,0,0} '10','20','30',null
{1,0,0} '11','21','31',null
{0,1,0} '12','22','32',null
{10,0,0} '13','23','33',null
{10,1,0} '14','24','34',null
{20,0,0} '15','25','35',null

Let us examine the tuple_no values. They are neither unique nor dense. To understand what happened, we need to understand how the buffer_size parameter works. buffer_size specifies the size (in bytes) of a buffer used to split the input data. Each buffer of data is distributed across the cluster (in a round-robin fashion) and loaded in an array chunk. If the chunk_size parameter (see documentation) is not explicitly specified, it is set to the value of the block_size.

In our example, the buffer_size is set to 10 bytes, the chunk_size is implicitly set to 10, and we have two SciDB instances in our cluster. As a consequence, the first two lines in the file fill the first buffer and are loaded by the first SciDB instance, in its first array chunk. Notice the tuple_no set to 0 and 1, respectively, and the dst_instance_id set to 0. The third line in the file fills the second buffer and it is loaded by the second SciDB instance, in its first array chunk. Next, the fourth line is loaded by the first SciDB instance in its second array chunk. Notice the tuple_no set to 10 (because chunk_size is 10) and the dst_instance_id set to 0. The process continues with the fifth line loaded on the second instance, in its second chunk, and the sixth line loaded on the first instance, in its third chunk. The table below shows in which instance and chunk each line ends up:

Chunk\Instance	`0`	`1`
1st `{0,...`	`'10',...` `'11',...`	`'12',...`
2nd `{10,...`	`'13',...`	`'14',...`
3rd `{20,...`	`'15',...`

We use this data-loading logic in order to generate a unique and dense set of values associated with each row (corresponding to their original order in the input file). we construct three new attributes using the tuple_no and dst_instance_id dimensions and sort the array (see documentation) on these new attributes:

# iquery --afl
AFL% sort(
       apply(
         aio_input('/rec-3.txt', 'num_attributes=3', 'buffer_size=10'),
         chunk_no,  tuple_no / 10,
         inst,      dst_instance_id,
         chunk_idx, tuple_no % 10),
       chunk_no, inst, chunk_idx);
{n} a0,a1,a2,error,chunk_no,inst,chunk_idx
{0} '10','20','30',null,0,0,0
{1} '11','21','31',null,0,0,1
{2} '12','22','32',null,0,1,0
{3} '13','23','33',null,1,0,0
{4} '14','24','34',null,1,1,0
{5} '15','25','35',null,2,0,0

The first new attribute, chunk_no keeps track of the chunk number to which a record belongs to. inst keeps track of the instance on which a record was loaded on. Finally, chunk_idx keeps track of the index of a record inside a specific chunk at a specific instance. The value 10 represents the implicit chunk_size used in our example. The result of the sort operator is a one-dimensional array where dimension n stores the unique and dense record indexes we need.

The default buffer_size is 8MB and the default chunk_size is 10,000,000. Depending on the input data and the values set for these parameters, the tuple_no dimension might end up containing unique and dense values and could be used directly, but this should be verified.

The input data and the queries are available here.

The Power of Loading Data - Part 3

Loading Table-like Data with Headers