Hash of DenseArrays?


#1

Hello All,

First off, I just would like to acknowledge how awesome this project is! There’s some very cool technology here, and I’m excited to see the project develop in the future!

As a bit of background I’m doing a deep dive with TileDB-py right now, primarily dealing with single attribute DenseArrays. My application issues multiple writes to the same array (with both complete re-assignment for every cell, and with sub-array assignment at arbitrary indices - which I am not able to predict in advance). Reads are issued for arbitrary (and unpredictable) subarrays as they existed at particular timesteps.

I’m wondering if there is any way to have tiledb record some hash (sha1, md5, etc) of the data stored in a DenseArray after a new fragment has been written? To clarify, I am referring to the hash of the full consolidated/superimposed data set (for every cell) which would be generated after writing some fragment.

Is there any built-in utility that would let me achieve this? If not, do you have any suggestions which might achieve this in a more sane way than following each write with an immediate read of the full array into memory so the data can be hashed manually?

Let me know, and thank you in advance!
Rick Izzo


#2

Hi there,

Thanks for reaching out and for your interest in TileDB!

Adding support for verifying the consolidated array contents via hashes is an excellent suggestion. Unfortunately, as you have already imagined, it is quite challenging, especially because the updates may overwrite previously stored cells.

We have brainstormed a bit internally about this. Here is a solution that you can relatively easily implement and, if it works satisfactorily, we will consider pushing it into TileDB. Please note that this will add additional cost to the writing operations:

  • Associate a hash XOR accumulator with the array. This accumulator will always be equal to H = h(c_1) XOR h(c_2) XOR ... XOR (c_n), where c_i is a cell value, n is the number of up-to-date cells your array stores at any point in time (n is variable here - it should depend on how many cells your array contains after all your updates), and h(c_i) is the hash on c_i (md5, sha1, etc.)

  • Upon writing a new set of cell values to the array, say c_1', c_2', ..., “add” them to your accumulator by simply computing H = H XOR h(c_1') XOR h(c_2') XOR ... .

  • If you never have cell overwrites, the above suffices. However, if your updates overwrite some old cell values, you need to identify and “subtract” the old values from your accumulator. You can do this by issuing the same write subarray as a read subarray, prior to your writes, in order to get the current “view” of the portion of the array that is about to be overwritten. Having those values at hand, say c_1'', c_2'', ..., “subtract” them from your accumulator by computing H = H XOR h(c_1'') XOR h(c_2'') XOR ... .

You can verify that with the above process, at any given point in time (regardless how many writes you have performed, how many TileDB fragments exist, etc.), your accumulator H is always the XOR of the hashes of the up-to-date cell values in your array (i.e., as if the array is consolidated). Given this accumulator and the collision-resistance properties of your hash, you have a way to always verify your array contents.

I hope this helps.

Cheers,
Stavros


#3

Hi Stavros,

Thank you so much for the incredible response! We’ve mocked up the procedure as shown below (in a very naive implementation), and everything seems to work as expected. I’m implementing it into our application now, and will update here to report back on the results. Thank you so much for your help!

Best,
Rick

import hashlib

def xor(aa, bb):
    return bytes(a ^ b for a, b in zip(aa, bb))

def arr_hash(x):
    bx = bytes(x)
    return hashlib.sha1(bx).digest()

def creat_h_accum(cells):
    '''create a hash accociated with array cell values via hash xor accumulation '''
    for cell in cells:
        h_cell = arr_hash(cell)
        try:
            h = xor(h, h_cell)
        except NameError:
            h = h_cell
    return h

def mutate_h_accum(h, cells):
    '''add or subtract cells values to a current hash xor accumulator'''
    for cell in cells:
        h_cell = arr_hash(cell)
        h = xor(h, h_cell) 
    return h

vals = [el for el in range(10)]
h_vals = creat_h_accum(vals)

print(f'vals: {vals}')
print(f'h_vals: {h_vals}')

vals: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
h_vals: b'J\xd0\xa9\x11\x8b\xab\xa1\x9e\x88\xe4\xd9 _}\xda;\xb8\xf6Sj'

new_vals = [el for el in range(10, 14)]
h_new_vals = mutate_h_accum(h_vals, new_vals)

print(f'new_vals: {new_vals}')
print(f'h_new_vals: {h_new_vals}')

h_is_same = True if h_new_vals == creat_h_accum([el for el in range(14)]) else False
print(f'\nmuated/added h is the same: {h_is_same}')

new_vals: [10, 11, 12, 13]
h_new_vals: b'\x1a\xa3U\xbah\xf6\xb4\xf1\xc2,\xca?\x92e\x15\x92\xe6\xfb\xa1\xa2'

muated/added h is the same: True

remove_vals = [0, 1]
h_removed_al = mutate_h_accum(h_new_vals, remove_vals)

print(f'remove_vals: {remove_vals}')
print(f'h_removed_vals: {h_removed_al}')

h_is_same = True if h_removed_al == creat_h_accum([el for el in range(2, 14)]) else False
print(f'\nmutated/removed hash is the same: {h_is_same}')

remove_vals: [0, 1]
h_removed_vals: b'\x9b3\xca\xc9\x86R\x06\xc3\xa2\xccT\x07E\x0bN\xf4\xa4\x81\xde\xe4'

mutated/removed hash is the same: True