[tile]DB and Armadillo


#1

Are there any plans to integrate with Armadillo

At the moment using [tile]DB and Armadillo would seem to require a copy of the data. There is a method of constructing an Armadillo matrix or cube from memory http://arma.sourceforge.net/docs.html#adv_constructors_cube however this is limited because changing the size of the Armadillo matrix would corrupt the [tile]DB memory. I guess there is also a way to take the Query api https://docs.tiledb.io/en/stable/c++-api.html#query and construct a [tile]DB array from an Armadillo matrix, again if the dimensions of the matrix or cube was then modified this would seem to require a new [tile]DB record to be created rather than modifying the existing record.

Any thoughts on this topic would be most welcome.

Peter.


#2

Hi Pete,

It seems like what you are asking about is a “global view” representation of a TileDB array using the C++ Armadillo array / matrix API (please correct me if this is not the case). In this way, all operations on Armadillo matrices would be reflected on disk (resize, reshape, applying an operation elementwise, etc.). This is not currently possible and I don’t think the Armadillo API is flexible enough to make this happen. What TileDB could support would be to load / save Armadillo matrix / arrays to storage but the entire array would have to be in memory. As you said you would do this through the query mechanism. If Armadillo has an API to construct an matrix / array from a dense buffer (and could expose the buffer of an already constructed armadillo matrix / array) then you could do this with zero copies (this is essentially how all the HL bindings to TileDB work).

For sparse arrays, Armadillo stores the matrices in CSC order. TileDB uses COO storage so there would be a conversion cost.

Jake


#3

Hi Jake,

Thanks for your prompt response.

If Armadillo has an API to construct an matrix / array from a dense buffer (and could expose the buffer of an already constructed armadillo matrix / array) then you could do this with zero copies (this is essentially how all the HL bindings to TileDB work).

Zero copy is essential for me for large matrix sizes where I may not have enough memory for multiple copies of the matrix. From my reading of the API after submit() (previously called finalize? https://docs.tiledb.io/en/latest/c++-api.html#_CPPv3N6tiledb5Query8finalizeEv) is called any armadillo matrix that I have constructed from (zero copy memory buffer) or used to construct a TileDB fragment would then have to be destroyed to avoid corruption. Which part would own the memory? For example if I construct a large TileDB matrix with large fragments created with armadillo and passed by memory buffer to TileDB how to handle the buffer ownership? If I call submit() then clear the armadillo matrix (perhaps in a different thread or submit_async() and modify before completion), thus modifying the memory now “owned” by TileDB then this would corrupt the fragment. Perhaps I need to dig deeper into TileDB and use the same bindings that the HL api’s use rather than the TileDB c++ api? In short its not clear to me how the memory buffer ownership works.

Second topic: if I dont know how big my TileDB matrix is at start of time, or want to reshape or resize the TileDB matrix, it is not obvious to me how I can do this with the query api. In short the very good examples that you provide with the code show me how to create, load, save but not modify ie resize a TileDB matrix after construction.

Thanks in advance for any assistance / pointers you can provide.

Peter.


#4

There should be no memory corruption as long as the buffer lifetime is longer than the query (the query status is COMPLETE). All buffers given to TileDB have this property, in that there is no ownership handoff of memory. This makes it nicer to use from HL language / api’s as the user can allocate say a native Numpy array or Armadillo matrix, provide TileDB a pointer to the underlying buffer, load / save the array / matrix data, and continue using the array or reshaping / modifying the buffer (in memory) after the TileDB query is completed.

We only touch upon memory management issues in our docs, I’ll discuss with the team and hopefully we can provide better examples / make this more clear in the documentation. Unfortunately the Sphinx render of the C++ API doxygen comments kind of obscures the fact that you can pass a pointer to an underlying buffer to the api and not just a Vec container type (there are some bugs with the C++ / Sphinx doc rendering).

For the second topic: we have largely not addressed this in the core TileDB library yet but it is on our roadmap (reshaping / resizing arrays). It is complicated by the fact that TileDB arrays are append only / immutable, so to do this multiple arrays would have to share the same underlying fragment data (if possible).


#5

I have a I/O bounded application that makes extensive use of Armadillo and I would like to play with TileDB to see if I can improve the performance.

The requirement that Jake mentions for it to be operative

could be satisfied with one of the methods

that Armadillo offers to interface to other libraries. Both offer a pointer to the memory of an already created matrix.

That seems to be just what I need. However, does entire array refer to the armadillo matrix or to the TileDB array? That is, could I save/load an entire Armadillo matrix that is just a subarray of the TileDB array?

How should I do it? Something like

// Create array 

Context ctx;
Domain domain(ctx);
domain.add_dimension(Dimension::create<int>(ctx, "rows", {{1, 30}}, 30))
  .add_dimension(Dimension::create<int>(ctx, "cols", {{1, 30}}, 30));
ArraySchema schema(ctx, TILEDB_DENSE);
schema.set_domain(domain).set_order({{TILEDB_COL_MAJOR, TILEDB_COL_MAJOR}});
schema.add_attribute(Attribute::create<double>(ctx, "a"));
Array::create(array_name, schema);

// Save

mat A(30,30, fill::zeros);
Context ctx;
Array array(ctx, array_name, TILEDB_WRITE);
Query query(ctx, array);
query.set_layout(TILEDB_COL_MAJOR)
     .set_buffer("a", A.memptr());
query.submit();
array.close();

// Load 

mat B(10,10);
const std::vector<int> subarray = {1, 10, 1, 10};
Context ctx;
Array array(ctx, array_name, TILEDB_READ);
Query query(ctx, array);
query.set_subarray(subarray)
     .set_layout(TILEDB_COL_MAJOR)
     .set_buffer("a", B.memptr()); 
array.close();

#6

I used the std::vector with the set_buffer api because I could not find a direct memory access api.
template <typename Vec> void set_buffer(const std::string& attr, Vec& buf) { static_assert( std::is_fundamental<typename Vec::value_type>::value, "Template type must be a vector of a fundamental type.");

It is possible to wrap the Armadillo memory address in a vector see https://stackoverflow.com/questions/7278347/c-pointer-array-to-vector. A direct memory set_buffer_api might exist now, I have not looked at the latest. Anyway it all works quite well, I even have a set of templates “Armadillo style” that give me access to the tiledb data. Of course n_elem, n_row and n_col does not quite cut it for tiled matrixes so I have a whole set of tile related members. Also explored TILEDB_COL_MAJOR/ TILEDB_GLOBAL_ORDER… clearly TILEDB_GLOBAL_ORDER is better for performance because TILEDB_COL_MAJOR invokes the tiling algorithms. Not yet done a performance comparison with saving data to disk via tiledb vs saving to disk via Armadillo. I suspect that Armadillo has less overhead and so will save faster than tiledb… So if its just performance you need you might find it better to stick with Armadillo.

Once you get your mind around the memory ordering of the tiles you will begin to understand how to save an Armadillo submatrix correctly. So I can save / load an Armadillo matrix into a set of tiles in tiledb using set_subarray() api. As soon as the Armadillo matrix you load /save is larger than a tile then you have to use TILEDB_COL_MAJOR and so you will experience the delays introduced by tiling. This type of delay is no different to the delays you get if you use Armadillo’s reshape or submat.

Yes. with my std::vector wrapper I use Vec tiles(data.memptr(), data.n_elem);

Yes.

Have fun!
Peter


#7

Hi Pete,

But the pointer to vector conversion that you link to is said to involve a copy. Not desirable in my case.

Just now I found an overloaded set_buffer that takes a pointer instead of the buffer itself.

Calls to set_buffer would be like

query.set_layout(TILEDB_COL_MAJOR)
     .set_buffer("a", A.memptr(), A.n_elem);

I have created a proof of concept gist here.

Maybe … With Armadillo it is very nice that HDF5 is tightly integrated so that it is very easy for the user to do his stuff but you loose control over how I/O is done. In a case like mine where maximum performance is needed other options should be considered. If I have to do I/O on my own there are two, probably equivalent, solutions for my problem set (dense arrays): HDF5 or TileDB. TileDB might require less writing effort.
There are two possible benefits that might come out of this change:

  • Applying compression filters to reduce read/write.
  • Thread-safe read/write. In my application I/O happens inside OpenMP parallel regions, which forces me to use critical sections (destroys parallelization in the way). I wonder if TileDB would represent any advantage with its concurrent write operations over HDF5.