To define a SciDB array we need to specify its dimensions. For each dimension, we need to specify its name, low value, high value, chunk length and chunk overlap (see documentation). The chunk parameters are somehow internal to SciDB and affect its performance. In this post, we look at a simple example where being careless about the chunk length gets us in trouble very fast.

When we are starting of with SciDB, we might ignore the chunk length parameter when declaring array dimensions. We can use the default values or specify some large values for the chunk length. For example:

# iquery --afl
AFL% create array foo<x:int64> [i];
Query was executed successfully
AFL% show(foo);
{i} schema
{0} 'foo<x:int64> [i=0:*,1000000,0]'
AFL% create array bar<x:int64> [i, j];
Query was executed successfully
AFL% show(bar);
{i} schema
{0} 'bar<x:int64> [i=0:*,1000,0,j=0:*,1000,0]'

As we can see, the default chunk length is 1,000,000 split across dimensions, so 1,000,000 for a one dimension array, 1,000 for each dimension for a two dimensions array, etc. This is probably not a big problem for most of the operators.

A Few Joins

Let’s add two records into the foo array and cross-join it two times. We store the result in a new array, taz:

# iquery --afl
AFL% store(
       redimension(
         build(<x:int64> [i=0:1,?,?], i),
         foo),
       foo);
{i} x
{0} 0
{1} 1
AFL% store(
       cross_join(
         cross_join(foo, foo),
         foo),
       taz);
{i,i_2,i_3} x,x_2,x_3
{0,0,0} 0,0,0
{0,0,1} 0,0,1
{0,1,0} 0,1,0
{0,1,1} 0,1,1
{1,0,0} 1,0,0
{1,0,1} 1,0,1
{1,1,0} 1,1,0
{1,1,1} 1,1,1

Chocking the slice Operator

Now, if we try to slice (see documentation), the taz array, by holding off one dimension, we get into trouble:

# iquery --afl
AFL% slice(taz, i, 0); -- takes a very long time to finish

On our SciDB instance, the query did not complete after running for a few hours with 100% CPU usage. We had to restart the database in order to stop it. We assume the query would eventually end.

The taz array has only 8 records. The problem is not the number of records in the array, but the chunk lengths. The original foo array, has one dimension with chunk length 1,000,000. The taz array has three dimensions with chunk length 1,000,000. The slice operator might try to allocate memory to hold a two-dimensional array (since we slice on one of the dimensions) with chunk length 1,000,000 in each dimension. This is probably too large and a lot of memory swapping might take place. All of this happens for just 8 records. Here is the schema for the taz array:

# iquery --afl
AFL% show(taz);
{i} schema
{0} 'taz<x:int64,x_2:int64,x_3:int64> [i=0:*,1000000,0,i_2=0:*,1000000,0,i_3=0:*,1000000,0]'

So, starting from the default chunk length and a few joins, the slice operator can get us in trouble really fast, even if we only have a hand full of records in the array. We recommend keeping an eye on the chunk length and its multiplicative effect across dimensions.

Alternatives to the slice Operator

If large chunk lengths across dimensions cannot be avoided, we recommend using between (see documentation) and redimension (see documentation) instead of slice. The same slicing operation we tried before can be achieved with:

# iquery --afl
AFL% redimension(
       between(
         taz,
         0, null, null,
         0, null, null),
       <x:int64,x_2:int64,x_3:int64> [i_2=0:*,1000000,0,i_3=0:*,1000000,0]);
{i_2,i_3} x,x_2,x_3
{0,0} 0,0,0
{0,1} 0,0,1
{1,0} 0,1,0
{1,1} 0,1,1

These example queries are available here.