How large can a dataset be before there is a problem with saving HDF5

ngkiaboon

Hi, we are using the CU-3 variant. We have a question regarding dataset size. Here is a brief introduction to how we are using ARTIQ:

We have an experiment script that runs a for loop, scanning through an experimental parameter.
With each instance of the loop, we take a picture of our setup (646 x 482 px²) and append it to our dataset with append_to_dataset().
At the end of the script, ARTIQ saves the dataset in the HDF5 file as per normal.

We noticed that if we take more than ₁₀₀ pictures, we cannot open the HDF5 file, and it shows us an error OSError: Unable to open file (bad object header version number). If we stay below 100 pictures, this error does not occur. It also does not look like an error with the camera or image, because we use the dashboard applet to show the last image taken, and there is nothing wrong with all the images shown by the applet.

At 120 pictures of the aforementioned size, the HDF5 file is 148 MB on disk. We initially thought that this could be a RAM issue, but the issue persists even after we upgraded the RAM on the computer running artiq_master to 64 GB from 16 GB.

We also thought that it could be the time delay between the start and end of the experiment, but we tried running a very very long experiment (using time.sleep()) without saving images, and the HDF5 file opens fine.

Is there a limit to the size of the dataset, or does anyone know what could have gone wrong with the saving of the HDF5 file?

Please help, thank you!

ngkiaboon

Is there perhaps a watchdog (ARTIQ6) on writing of the HDF5? If so, how can I remove the watchdog or extend the time limit before it times out?

sb10q

ngkiaboon Is there perhaps a watchdog (ARTIQ6) on writing of the HDF5?

No there isn't.

Are you able to create such large HDF5 files by using h5py in the Python REPL?

ngkiaboon

sb10q Yes, I can create 200 MB HDF5 files (one dataset with np.zeros(26000000)) with no problem; no explicit compression used:

with h5py.File('blah.h5', 'w'):
    f.create_dataset('blah', data=np.zeros(26000000))

This is done on the same computer that runs artiq_master.

I tried messing around with saving files through jobs submitted to artiq_master a little bit more (created dummy jobs that only save a large array into dataset). The limit seems to be around np.zeros(25900000), corresponding to about 101 MB on disk, but the threshold can move around. Sometimes I can save a good HDF5 file with np.zeros(25900000), but sometimes it will give me an HDF5 file with a bad object header version number error, all ran with the exact same arguments. A trend I noticed is that bad HDF5 files will be slightly smaller than the good ones (on the order of 1 to a few tens of KB), but there could also be bad HDF5 files with the same file sizes as the good ones; all with the same arguments.

Posts from the HDF Group forum seem to suggest that there could something wrong during the file writing process. At this point I am not sure if it is a h5py problem, or hardware problem (e.g. not enough memory because some of it is used for other processes running in the background) despite our upgraded RAM, or timeout related issues.

Regarding my suspicion involving timeouts, sometimes when bad HDF5 files are written, I see log entries on my artiq_dashboard along the lines of worker refuses to die or a job running in parallel (interrupted through scheduler.pause()) would terminate because of core device connection closed unexpectedly. This does not happen for all the bad HDF5 files written (most of them are near the threshold), but it happens very repeatedly if I try to write a very very large file such that it far exceeds the threshold (but still way below 200 MB of file size on disk). So it seems to me that something is watching the tasks, killing them whenever their time is up.

You mentioned that there is no watchdog for the writing of the HDF5, but are there other watchdogs watching relevant processes (in asyncio maybe) that may affect the saving of the HDF5 file?

sb10q

ngkiaboon You mentioned that there is no watchdog for the writing of the HDF5, but are there other watchdogs watching relevant processes (in asyncio maybe) that may affect the saving of the HDF5 file?

asyncio isn't involved in the experiment itself, experiments are run in isolated processes based on worker_impl.py.

ngkiaboon

I am going to try this on another computer. The code that I submitted to artiq_master is the following:

class Dummy(EnvExperiment):
    """ # Bare minimum to check for `bad object header version number """
    def build(self):

        self.setattr_argument('size', NumberValue(default=26500000, type='int', ndecimals=0, step=1), tooltip="Number of elements")

    def prepare(self):
        pass

    def run(self):

        self.set_dataset('images', [], broadcast=False)

        frame = np.zeros(self.size)

        self.append_to_dataset('images', np.array(np.flip(frame), dtype=int))

The bad object header version number happens at around 26.5 million.

Here's a screenshot of our artiq versions:

I wonder if it is possible to recreate the error on your side? ~~I shall try it on other computers on my side in the meantime.~~

Update: I tried running artiq_master off a different computer, submitting the Dummy job to this artiq_master, and the problem persists. My current workaround is to set dtype='uint8' instead of dtype=int (because my image consists of an array of integers in [0,256)), but I foresee that I will need access to larger arrays in the future.

sb10q

If you suspect the master might be killing the worker process during the file save operation, try running the experiment independently with artiq_run.

ngkiaboon

sb10q If you suspect the master might be killing the worker process during the file save operation, try running the experiment independently with artiq_run.

I tried running with artiq_run, and it ran perfectly fine; the HDF5 files were saved with no problems at all when I set the size of the array to, say, 50 million, much larger than the ~26.5 million limit (with 'int32'). Ran the exact same job through artiq_dashboard and I would get the same error with bad object header version number when I try to open the HDF5 file.

So this seems to confirm my suspicion that master is somehow killing the worker process. Let me take a look at worker_impl.py, but I can't promise that I will get far, so any help will be appreciated.

jpagett

I am also having this issue when attempting to save datasets containing many images. Did you make headway on this?

ngkiaboon

jpagett Nope, we just set datatype to 'uint8' to reduce the size of each image, and it works for us for now; it would be great if someone can get to the bottom of this though. Good to know that we are not the only ones with this problem.

jpagett

ngkiaboon I suspected that the data writing process might be getting cut short somewhere. I believe I've fixed the issue by making a small change to the main function in worker_impl.py. I am on ARTIQ-7.

Prior to the change:

        elif action == "analyze":
             try:
                    exp_inst.analyze()
                    put_completed()
                finally:
                    # browser's analyze shouldn't write results,
                    # since it doesn't run the experiment and cannot have rid
                    if rid is not None:
                        write_results()

I switched put_completed() and the write_results() block:

        elif action == "analyze":
             try:
                    exp_inst.analyze()
                    # browser's analyze shouldn't write results,
                    # since it doesn't run the experiment and cannot have rid
                    if rid is not None:
                        write_results()
                finally:
                    put_completed()

I can now save large datasets, in excess of 2.5GB. It causes the end of the experiment to wait a bit, presumably because the analyze() process is now waiting for the hdf5 writing to finish.