Hello,
We have met a problem that the kernel of a background experiment (an experiment scheduled with low priority, so it runs when there is no other experiments running) stops every 1 to 3 days. The background experiment is used to read PMT counts and to set DDS parameters. The error message is attached below. From the message it seems to be related to RPC calls. The experiment do ~40 RPC calls per second, with ~10 of them being async calls. So it does on the order of ~300 000 RPC calls before raising the error.
root:Terminating with exception (KeyError: 40817)
Traceback (most recent call last):
File "C:\Users\scientist\code\artiq\artiq\master\worker_impl.py", line 300, in main
exp_inst.run()
File "C:\Users\scientist\code\jax\tools\experiments\pmt.py", line 52, in run
self.run_kernel()
File "C:\Users\scientist\code\artiq\artiq\language\core.py", line 54, in run_on_core
return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
File "C:\Users\scientist\code\artiq\artiq\coredevice\core.py", line 137, in run
self.comm.serve(embedding_map, symbolizer, demangler)
File "C:\Users\scientist\code\artiq\artiq\coredevice\comm_kernel.py", line 642, in serve
self._serve_rpc(embedding_map)
File "C:\Users\scientist\code\artiq\artiq\coredevice\comm_kernel.py", line 542, in _serve_rpc
args, kwargs = self._receive_rpc_args(embedding_map)
File "C:\Users\scientist\code\artiq\artiq\coredevice\comm_kernel.py", line 394, in _receive_rpc_args
value = self._receive_rpc_value(embedding_map)
File "C:\Users\scientist\code\artiq\artiq\coredevice\comm_kernel.py", line 387, in _receive_rpc_value
return receivers.get(tag)(self, embedding_map)
File "C:\Users\scientist\code\artiq\artiq\coredevice\comm_kernel.py", line 148, in <lambda>
embedding_map.retrieve_object(kernel._read_int32()),
File "C:\Users\scientist\code\artiq\artiq\compiler\embedding.py", line 118, in retrieve_object
return self.object_forward_map[obj_key]
KeyError: 40817
There seems to be nothing in the core log about this error at the INFO
level. After the error is raised, it seems that we also lose control of all sinara devices (e.g. reading TTL counts, setting TTL/DDS states) until we restart the sinara hardware. The experiment that produces the error is too long to be attached here.
Before working on a minimal experiment example to replicate this error (it is hard to test these errors as it happens on the order of days), I am wondering if someone has met such KeyError
s before in the kernel, or may have insights about the error message above.
We use gateware version 7.7817.4bfd010f.beta, artiq version (https://github.com/m-labs/artiq/tree/21b07dc6673454b74ac9116cad59df65e6d9a467), python 3.8 on Windows 10.
Thank you!
Mingyu