We have been having a problem where at seemingly random intervals, our Kasli stops responding for some number of minutes, then restarts.
At first this occurred after 10-20 minutes, but yesterday the frequency increased to every couple minutes. The frequency has since decreased again. It seems roughly correlated with how often we are submitting experiments (often to adjust DDS outputs, analog outputs that we leave on), but this is not a scientific statement -- it is difficult to nail down an issue that occurs on these timescales.
During the period the Kasli is not responding, the “Error” light is not lit. No error messages are logged to UART (log level set to TRACE) at any point.
Right now we are running the following simple idle kernel as a placeholder.
from artiq.experiment import *
class IdleKernel(EnvExperiment):
def build(self):
self.setattr_device("core")
@kernel
def run(self):
self.core.reset()
start_time = now_mu() + self.core.seconds_to_mu(20*s)
while self.core.get_rtio_counter_mu() < start_time:
pass
The UART log of the idle kernel finishing and starting again has been our indicator if the Kasli has stopped responding.
When Kasli stops responding, there is no new UART output, and attempts to submit more experiments (via artiq_run
or the dashboard) time out. Some time later (roughly a minute, sometimes up to a couple minutes), the MiSoC bootloader startup is logged again to the UART output, and Kasli then responds as normal.
All DDS channels turn off when Kasli stops responding. I assume other channel types also turn off, but we often continuously output on the DDS channels to drive AOMs, so this is the one we notice. This has proven very frustrating during alignment, since we either need to power cycle Kasli or wait for it to respond again for our startup kernel to start our DDS channels again.
Please advise what information is needed to further diagnose the issue.