How do users choose folder to save data in?

fanmingyu212

By default ARTIQ saves data in the "results" subfolder under the working directory that artiq_master is called at. Is there an argument that can be used to change the data saving directory?

We edited the artiq_master.py file in our ARTIQ fork to save data in arbitrary locations, but I am wondering if vanilla ARTIQ supports or should support this.

sb10q

Why do you need this? Is changing the current directory before running the master not good enough? The general idea is the current directory is the "home folder" for your ARTIQ instance, with the device and dataset DBs and the results all in one place.

fanmingyu212

sb10q Is changing the current directory before running the master not good enough?

The most important reason for us is that we want to save data in a folder structure that groups data by names of different control computers, e.g.,

data
- computer_1_data
  - artiq_data
  - other_data (for example, data manually recorded)
- computer_2_data
  - artiq_data
  - other_data
- ...

The root "data" folder is shared between different computers with tools such as NAS or Google Drive. ARTIQ data are saved in the corresponding "artiq_data" folder of each of the computers. We can in principle run artiq_master from the "artiq_data" directory, but that creates one additional folder that we don't need in our data saving structure, and also stores the device_db.py and other log files on the NAS which we might not want.

On the other hand, since artiq_master supports using user-defined paths for device_db, dataset_db, repository, and log files, shouldn't it support saving data files in user-defined directory too?

fanmingyu212

@sb10q just want to follow up on this issue. Do you think this issue worth discussion in the artiq repo? I can also prepare a PR for adding an argument in artiq_master to save data to another location.

sb10q

Yes, send a PR.

rmattish

Is there any update with regards to this functionality? It would be really great to be able to specify a directory to save data to. Also if we could add a file name prefix for better organization, that would be nice. Organizing data by timestamps alone isn't really all that helpful.

pyquest

Hi All. I am also interested in this functionality Is there any update?

From the source code https://m-labs.hk/artiq/manual/_modules/artiq/language/environment.html#HasEnvironment.set_dataset the dataset save code is completely hidden. Where to find it?

. Also, doesn't the artiq master only run in the artiq master directory?

dpn

I don't think there have been any developments, but also choosing output directories is rather contrary to the workflow that really works well in a lot of laboratories. The ARTIQ result files are all saved in a single output directory (well, subdivided by date), which acts as one central location for archival, with the RID (run id) acting as a unique identifier. This directory can be mirrored (using programs like lsyncd) to network shares as required, and analysis scripts, etc. can in turn pull from this central repository.

By keeping the files in one place, with a single canonical naming scheme (the run id), data provenance is easy to ensure – even when coming back to some old results years later, all the old data will be easy to locate. Further search indexes, extra metadata, etc. can be built on top of this. Of course, this isn't the only potential design, but it really has worked well here.

dpn

As for the implementation, see artiq.master.worker_db (and the top-level worker process code in artiq.master.worker_impl).

pyquest

dpn

Apologies for the code posted below. I do not understand how to format it properly in the code brackets here. I keep trying and failing...

I think we get the philosophy. But that workflow can be suboptimal depending on how sophisticated the measurements are. For instance, let's imagine a control script with the ability to run 200 different nested for loops. This is completely unmanageable/doesn't scale with the current nested loop approach (unless I am missing something). It is also a real challenge when data is manually recorded. Where does that get put? So the extremes of high level automation and manual data entry don't work great with the current workflow.

I'll give an example for a high level of sophistication since the manual case is obvious.

A much cleaner way to handle loops ( and change the order of the loops) would be to use numpy's meshgrid and a function like this to create the nested loops and make it easy to alter the order.

`
def loops(arrs):
'''Creates nested for loops for arbitrary numbers of arguments,
the first argument is the first loop, and the last is the last '''
return np.transpose(np.meshgrid(arrs)).reshape(-1, len(arrs))

loop_array = loops(x,y,z)
`
Then to track it these loops i.e. generate metadata maybe we want to throw this into a dataframe with column names

'
scan_df = pd.DataFrame(loop_array,columns = column_names)
scan_df.to_csv(direc+'scan_params.csv')

'
.
Now the 200 nested loops of the script is replaced with a single for loop. Each "state" of the experimental parameters is now recorded in the loops array.

It would notionally be nice to throw the loop metadata into the H5 file. However, my understanding of the H5 file is that it is written after a scan and that the datasets in the environment don't support dataframes. Hence, we would like the ability to alter the default directory structure. I.e. It would be better, to make folders with the RID , default save the H5 file in that folder and then add other structures as needed in that new directory.

For similar reasons, I think it would be an upgrade to the core language to explicitly make the scan object NoScan explicitly have a .sequence value. This would make the syntax identical to the other scan objects and it would make it easier to generate the sequence for the scans described above. To me it is not a "NoScan", it is a "Repeat Scan" that repeats x times or possibly x*y times if in a nested loop.

`
class RepeatScan(ScanObject):
def init(self, value, repetitions=1):
self.value = value
self.repetitions = repetitions
self.sequence = np.ones(self.repetitions) * self.value

def _gen(self):
    for i in range(self.repetitions):
        yield self.value

def __iter__(self):
    return self._gen()

def __len__(self):
    return self.repetitions

def describe(self):
    return {
        "ty": "BScan2",
        "value": self.value,
        "repetitions": self.repetitions
    }

dpn

pyquest I'm not sure I understand the point about 200 nested for loops; surely this is never a sensible solution? In either case, we have https://github.com/OxfordIonTrapGroup/ndscan for this, which allows you to select any number of parameters (out of thousands) to scan and/or override, without any explicit loops.

ndscan just saves the full metadata (which parameters are overridden/scanned, etc.) to the HDF5 file; no extra data to the side necessary.

We don't typically keep manually recorded values in a systematic fashion beyond a lab book, as in my experience, I would try to include as much as possible inside the HDF5 file (you could e.g. add a dummy argument not actually used by the experiment if you often work with manually generated data series). Having a single file to archive (which also includes the source revision hash, etc.) is quite valuable. Of course, you could also have a manual data store on the side, whether in the same directory, or just a "database" of some kind (e.g. a table, or directory of files) indexed by RID.

None of this would be an argument against having an option to switch to an result path format with directories for each rid per se; the only consideration that comes to mind is the overhead from the added complexity. I just mean to point out that one can work around the limitations of HDF5, and having a single file as an atomic unit representing each experiment turns out to be very convenient.