[tile]DB and Armadillo


#1

Are there any plans to integrate with Armadillo

At the moment using [tile]DB and Armadillo would seem to require a copy of the data. There is a method of constructing an Armadillo matrix or cube from memory http://arma.sourceforge.net/docs.html#adv_constructors_cube however this is limited because changing the size of the Armadillo matrix would corrupt the [tile]DB memory. I guess there is also a way to take the Query api https://docs.tiledb.io/en/stable/c++-api.html#query and construct a [tile]DB array from an Armadillo matrix, again if the dimensions of the matrix or cube was then modified this would seem to require a new [tile]DB record to be created rather than modifying the existing record.

Any thoughts on this topic would be most welcome.

Peter.


#2

Hi Pete,

It seems like what you are asking about is a “global view” representation of a TileDB array using the C++ Armadillo array / matrix API (please correct me if this is not the case). In this way, all operations on Armadillo matrices would be reflected on disk (resize, reshape, applying an operation elementwise, etc.). This is not currently possible and I don’t think the Armadillo API is flexible enough to make this happen. What TileDB could support would be to load / save Armadillo matrix / arrays to storage but the entire array would have to be in memory. As you said you would do this through the query mechanism. If Armadillo has an API to construct an matrix / array from a dense buffer (and could expose the buffer of an already constructed armadillo matrix / array) then you could do this with zero copies (this is essentially how all the HL bindings to TileDB work).

For sparse arrays, Armadillo stores the matrices in CSC order. TileDB uses COO storage so there would be a conversion cost.

Jake


#3

Hi Jake,

Thanks for your prompt response.

If Armadillo has an API to construct an matrix / array from a dense buffer (and could expose the buffer of an already constructed armadillo matrix / array) then you could do this with zero copies (this is essentially how all the HL bindings to TileDB work).

Zero copy is essential for me for large matrix sizes where I may not have enough memory for multiple copies of the matrix. From my reading of the API after submit() (previously called finalize? https://docs.tiledb.io/en/latest/c++-api.html#_CPPv3N6tiledb5Query8finalizeEv) is called any armadillo matrix that I have constructed from (zero copy memory buffer) or used to construct a TileDB fragment would then have to be destroyed to avoid corruption. Which part would own the memory? For example if I construct a large TileDB matrix with large fragments created with armadillo and passed by memory buffer to TileDB how to handle the buffer ownership? If I call submit() then clear the armadillo matrix (perhaps in a different thread or submit_async() and modify before completion), thus modifying the memory now “owned” by TileDB then this would corrupt the fragment. Perhaps I need to dig deeper into TileDB and use the same bindings that the HL api’s use rather than the TileDB c++ api? In short its not clear to me how the memory buffer ownership works.

Second topic: if I dont know how big my TileDB matrix is at start of time, or want to reshape or resize the TileDB matrix, it is not obvious to me how I can do this with the query api. In short the very good examples that you provide with the code show me how to create, load, save but not modify ie resize a TileDB matrix after construction.

Thanks in advance for any assistance / pointers you can provide.

Peter.


#4

There should be no memory corruption as long as the buffer lifetime is longer than the query (the query status is COMPLETE). All buffers given to TileDB have this property, in that there is no ownership handoff of memory. This makes it nicer to use from HL language / api’s as the user can allocate say a native Numpy array or Armadillo matrix, provide TileDB a pointer to the underlying buffer, load / save the array / matrix data, and continue using the array or reshaping / modifying the buffer (in memory) after the TileDB query is completed.

We only touch upon memory management issues in our docs, I’ll discuss with the team and hopefully we can provide better examples / make this more clear in the documentation. Unfortunately the Sphinx render of the C++ API doxygen comments kind of obscures the fact that you can pass a pointer to an underlying buffer to the api and not just a Vec container type (there are some bugs with the C++ / Sphinx doc rendering).

For the second topic: we have largely not addressed this in the core TileDB library yet but it is on our roadmap (reshaping / resizing arrays). It is complicated by the fact that TileDB arrays are append only / immutable, so to do this multiple arrays would have to share the same underlying fragment data (if possible).