OK I have following setting. 5000 samples and 600k Loci. and My tile configuration was 1 X 600k previously. Which means one sample would be contained in one tile spanning 600k cells. And there would be total of 5000 tiles(one tile/ sample). I write by sample and read across the sample.
Which means that given this configuration, I would be able to write very fast(which I was). However, my read time suffered because to read one loci information, i have to jump 5000 tiles for each sample.
Array with this configuration had a disk space of 25Gb. Which is fine.
Now, I wanted to improve the read times and wanted to make sure that I don’t at least have to read across 5000 tiles just to read one loci information. Now to be able to do this. I changed my tile dimensions to (10 X 600k) instead of (1 X 600k ) and what I am seeing is that size of array on the disk is substaintally high. What I am failing to understand is that its the same amount of data orchestrated differently. Why is there a huge difference in the diskspace. Or I am doing something wrong?