WorkerWatchdogTimeout after ARTIQ has been running for a while

Aaletofful · Jan 24, 2023

It is a typical occurrence that if ARTIQ has been running for about a day with even only a few simple monitoring experiments running in the background, if I then try to submit a new experiment I get this error:

artiq.master.worker:worker exception details Traceback (most recent call last): File "C:\Users\ssr1\.conda\envs\artiq-677\lib\site-packages\artiq\master\worker.py", line 252, in _worker_action completed = await self._handle_worker_requests() File "C:\Users\ssr1\.conda\envs\artiq-677\lib\site-packages\artiq\master\worker.py", line 238, in _handle_worker_requests await self._send(reply) File "C:\Users\ssr1\.conda\envs\artiq-677\lib\site-packages\artiq\master\worker.py", line 169, in _send raise WorkerTimeout( artiq.master.worker.WorkerTimeout: Timeout sending data to worker (RID 27875) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Users\ssr1\.conda\envs\artiq-677\lib\site-packages\artiq\master\scheduler.py", line 268, in _do completed = await run.run() File "C:\Users\ssr1\.conda\envs\artiq-677\lib\site-packages\artiq\master\scheduler.py", line 34, in worker_method return await m(*args, **kwargs) File "C:\Users\ssr1\.conda\envs\artiq-677\lib\site-packages\artiq\master\worker.py", line 278, in run completed = await self._worker_action({"action": "run"}) File "C:\Users\ssr1\.conda\envs\artiq-677\lib\site-packages\artiq\master\worker.py", line 254, in _worker_action raise WorkerWatchdogTimeout artiq.master.worker.WorkerWatchdogTimeout

This always happens when I launch a new experiment in the morning after no new experiments have been launched overnight, so I am not sure whether the problem is related to ARTIQ/experiments that have been running for a long time, or the fact that no new experiments have been launched in a long period of time.

The issue is sometimes resolved by resubmitting the experiment a few times and eventually it doesn't time out. The issue is always resolved by rebooting ARTIQ. These solutions are unsuitable for us at the moment and going forward. Could anyone explain why this issue arises?

UPDATE: I set up an experiment in the background to schedule a simple experiment (logs numbers from 1 to 100) every hour. This experiment was launched at 14:00, in the meantime during the day other experiments were regularly being submitted until around 18:00 and ARTIQ was working fine. After that, the hourly experiments worked until 21:00 when the time out error occurred. So it seems like even if ARTIQ is being kept "active" by having experiments submitted regularly the timeout still occurs.

Rrjo · Feb 9, 2023

Maybe a firewall or virus scanner is killing some TCP connections among the processes. Could you try deactivating that?
I don't think this has been observed elsewhere.

Aaletofful · Feb 9, 2023

rjo The firewall is disabled and in the event log of the antivirus (McAfee) there have been no scans happening over night.

I am currently investigating the possibility that a desynchronisation of the core and PC clocks is causing the timeout to occur. I will post an update when I have confirmed a fix for this.

Rrjo · Feb 9, 2023

That sounds intriguing. If you can show the relevant parts of that hourly experiment maybe we can build a hypothesis.

Aaletofful · Mar 13, 2023

rjo In the process of debugging this problem, I have been experiencing more related errors that are different from the WorkerWatchdogTimeout, so I will try and summarise my findings so far. The premise is that there is something causing our computer to slow down significantly over time, and this is diagnosed by monitoring the threads in the CPU performance tab, which are observed to continuously increase (e.g. from 2600 to 4400 in 6 hours). In the Resource Monitor, the process that has the most threads which are observed to increase continuously is the "NT Kernel & System". We are not sure yet if this is due to an ARTIQ experiment or some other software running on the computer, and for this month we are not in the position to debug this by not running certain pieces of software. Another piece of possibly useful information is that when ARTIQ is launched, the CPU starts running at 110% of its base frequency, and remains constantly overclocked.

For some context on how we are currently using ARTIQ for running experiments with our ion trap: we run a lock experiment to continuously keep us in lock to an atomic transition. Then in a separate pipeline we have an ion monitoring experiment which checks every 5 minutes if the level of fluorescence is above a set threshold. If it detects that it's gone below the threshold, we request the lock experiment to end and we then launch an ion recovery experiment with higher priority than the monitoring one. Upon recovery, a new lock experiment is launched and the monitoring is resumed. All the errors described below prevent the ion recovery experiment to launch correctly.

WorkerWatchdogTimeout: in my many attempts to get rid of this error, I have found a solution based on a former colleauge's code to resynchronise the core and PC clocks. The timestamp synchroniser class is shown below:

class TimestampSingleton(HasEnvironment):
    _sing = None  # type: __Singleton
    _rid = None
 
    class __Singleton:
        def __init__(self):
            self.start_times_mu = {}
            self.start_timestamps = {}
 
    def build(self):
        self.setattr_device("core")
 
    def __init__(self, mgr, rid):
 
        super().__init__(mgr)
        TimestampSingleton._sing = TimestampSingleton.__Singleton()
        self._rid = rid
 
    @host_only
    def sync(self):
 
        TimestampSingleton._sing.start_times_mu[self._rid] = self.core.get_rtio_counter_mu()
        TimestampSingleton._sing.start_timestamps[self._rid] = time.time()

Then, in the ion monitoring experiment, the following code is added to the build() function:

if hasattr(self.scheduler, "rid"):
     __class__.timestamp_singleton = TimestampSingleton(self, self.scheduler.rid)

and finally, just before the line of code that schedules the ion recovery experiment, the timestamps are synchronised by calling:

__class__.timestamp_singleton.sync()

As this is not my own code, I don't fully understand if this is expected to fix the WorkerWatchdogTimeout, but it seems to. The points below talk about other errors I've been experiencing due to the slowdown of the PC, hopefully this is helpful to anyone having similar issues.

ConnectionAbortedError [WinError 10053]: This error would appear when trying to launch the recovery experiment too soon after requesting the lock experiment to terminate (i.e. with no time delay between the two actions). Once again, this error only happens when the PC has been running for some time, and because due to its slowdown the lock experiment takes a particularly long time to terminate, the recovery would be launched while the lock experiment was still in the process of terminating itself. This issue was fixed by simply adding a 15 second delay between requesting the termination of the lock and launching the recovery.
SystemExit: This error would appear in the automatically scheduled recovery experiment, crashing it before it could even start. It would always point to the self.setattr_device("core") line of code in the build() function of the recovery experiment. Again, this error only happens when the PC has been running for some time. This issue seems to have been fixed by always building the core device before any other device, i.e. inserting the self.setattr_device("core") line of code at the start of the build() function, or just after having defined all the arguments, but before building any other devices. I still don't understand why this error doesn't appear when the PC is fresh from a restart, and why this fix actually works.

If anyone has any idea of what could be causing such a PC slowdown which leads to this variety of ARTIQ errors, please let me know!

sb10q · Mar 16, 2023

aletofful when ARTIQ is launched

Which part (master, dashboard, ...)?

Aaletofful · Mar 16, 2023

sb10q this is when launching an ARTIQ session consisting of a master, controller and dashboard.

sb10q · Mar 20, 2023

That's not a very helpful answer, please narrow it down.

Aaletofful · Mar 31, 2023

sb10q sorry for the late reply. I've established that after launching the ARTIQ dashboard, while it's loading the repository, the CPU frequency goes to 110% but then goes back down once the list of experiments is loaded. We then found it was another piece of software, not ARTIQ related, that then keeps the CPU frequency constantly at 110% at all times while it's running. We haven't yet confirmed if this software is the cause of the PC slowdown leading to these ARTIQ errors, but next week we should get the chance to look into this issue in more depth.