rjo In the process of debugging this problem, I have been experiencing more related errors that are different from the WorkerWatchdogTimeout, so I will try and summarise my findings so far. The premise is that there is something causing our computer to slow down significantly over time, and this is diagnosed by monitoring the threads in the CPU performance tab, which are observed to continuously increase (e.g. from 2600 to 4400 in 6 hours). In the Resource Monitor, the process that has the most threads which are observed to increase continuously is the "NT Kernel & System". We are not sure yet if this is due to an ARTIQ experiment or some other software running on the computer, and for this month we are not in the position to debug this by not running certain pieces of software. Another piece of possibly useful information is that when ARTIQ is launched, the CPU starts running at 110% of its base frequency, and remains constantly overclocked.
For some context on how we are currently using ARTIQ for running experiments with our ion trap: we run a lock experiment to continuously keep us in lock to an atomic transition. Then in a separate pipeline we have an ion monitoring experiment which checks every 5 minutes if the level of fluorescence is above a set threshold. If it detects that it's gone below the threshold, we request the lock experiment to end and we then launch an ion recovery experiment with higher priority than the monitoring one. Upon recovery, a new lock experiment is launched and the monitoring is resumed. All the errors described below prevent the ion recovery experiment to launch correctly.
- WorkerWatchdogTimeout: in my many attempts to get rid of this error, I have found a solution based on a former colleauge's code to resynchronise the core and PC clocks. The timestamp synchroniser class is shown below:
class TimestampSingleton(HasEnvironment):
_sing = None # type: __Singleton
_rid = None
class __Singleton:
def __init__(self):
self.start_times_mu = {}
self.start_timestamps = {}
def build(self):
self.setattr_device("core")
def __init__(self, mgr, rid):
super().__init__(mgr)
TimestampSingleton._sing = TimestampSingleton.__Singleton()
self._rid = rid
@host_only
def sync(self):
TimestampSingleton._sing.start_times_mu[self._rid] = self.core.get_rtio_counter_mu()
TimestampSingleton._sing.start_timestamps[self._rid] = time.time()
Then, in the ion monitoring experiment, the following code is added to the build()
function:
if hasattr(self.scheduler, "rid"):
__class__.timestamp_singleton = TimestampSingleton(self, self.scheduler.rid)
and finally, just before the line of code that schedules the ion recovery experiment, the timestamps are synchronised by calling:
__class__.timestamp_singleton.sync()
As this is not my own code, I don't fully understand if this is expected to fix the WorkerWatchdogTimeout, but it seems to. The points below talk about other errors I've been experiencing due to the slowdown of the PC, hopefully this is helpful to anyone having similar issues.
ConnectionAbortedError [WinError 10053]: This error would appear when trying to launch the recovery experiment too soon after requesting the lock experiment to terminate (i.e. with no time delay between the two actions). Once again, this error only happens when the PC has been running for some time, and because due to its slowdown the lock experiment takes a particularly long time to terminate, the recovery would be launched while the lock experiment was still in the process of terminating itself. This issue was fixed by simply adding a 15 second delay between requesting the termination of the lock and launching the recovery.
SystemExit: This error would appear in the automatically scheduled recovery experiment, crashing it before it could even start. It would always point to the self.setattr_device("core")
line of code in the build()
function of the recovery experiment. Again, this error only happens when the PC has been running for some time. This issue seems to have been fixed by always building the core device before any other device, i.e. inserting the self.setattr_device("core")
line of code at the start of the build()
function, or just after having defined all the arguments, but before building any other devices. I still don't understand why this error doesn't appear when the PC is fresh from a restart, and why this fix actually works.
If anyone has any idea of what could be causing such a PC slowdown which leads to this variety of ARTIQ errors, please let me know!