sb10q Yes, I can create 200 MB HDF5 files (one dataset with np.zeros(26000000)
) with no problem; no explicit compression used:
with h5py.File('blah.h5', 'w'):
f.create_dataset('blah', data=np.zeros(26000000))
This is done on the same computer that runs artiq_master
.
I tried messing around with saving files through jobs submitted to artiq_master
a little bit more (created dummy jobs that only save a large array into dataset). The limit seems to be around np.zeros(25900000)
, corresponding to about 101 MB on disk, but the threshold can move around. Sometimes I can save a good HDF5 file with np.zeros(25900000)
, but sometimes it will give me an HDF5 file with a bad object header version number
error, all ran with the exact same arguments. A trend I noticed is that bad HDF5 files will be slightly smaller than the good ones (on the order of 1 to a few tens of KB), but there could also be bad HDF5 files with the same file sizes as the good ones; all with the same arguments.
Posts from the HDF Group forum seem to suggest that there could something wrong during the file writing process. At this point I am not sure if it is a h5py
problem, or hardware problem (e.g. not enough memory because some of it is used for other processes running in the background) despite our upgraded RAM, or timeout related issues.
Regarding my suspicion involving timeouts, sometimes when bad HDF5 files are written, I see log entries on my artiq_dashboard
along the lines of worker refuses to die
or a job running in parallel (interrupted through scheduler.pause()
) would terminate because of core device connection closed unexpectedly
. This does not happen for all the bad HDF5 files written (most of them are near the threshold), but it happens very repeatedly if I try to write a very very large file such that it far exceeds the threshold (but still way below 200 MB of file size on disk). So it seems to me that something is watching the tasks, killing them whenever their time is up.
You mentioned that there is no watchdog for the writing of the HDF5, but are there other watchdogs watching relevant processes (in asyncio maybe) that may affect the saving of the HDF5 file?