We run a custom 3 Kasli-SoC system (master connected to both satellites directly). The gateware is built from the latest artiq-zynq release-8 branch. Cards can be controlled with minimal examples (e.g. LEDs of Urukuls on satellite do turn on and off when sw is set), but it will fail with
ERROR(runtime::rtio_mgt::drtio): [DEST#1] communication failed (timed out)
ERROR(runtime::rtio_mgt::drtio): [DEST#2] communication failed (timed out)
beyond that, including with artiq_sinara_tester.
The minimal experiment causing the issue is simply:
from artiq.experiment import *
class TimeoutExample(EnvExperiment):
def build(self):
self.setattr_device("core")
@kernel
def run(self):
print("beginning of the end")
self.core.reset()
delay(1.0 * s)
self.core.wait_until_mu(now_mu())
print("end of the beginning")
Setting the delay lower will actually make the problem appear less frequently, until it disappears completely. Removing wait_until_mu
will also make the log disappear.
Trace on the master would look something like that:
[ 2790.725019s] TRACE(dyld::reloc): resolved symbol "rpc_send_async"
[ 2790.731102s] TRACE(dyld::reloc): resolved symbol "__artiq_personality"
[ 2790.737619s] TRACE(dyld::reloc): resolved symbol "rpc_send"
[ 2790.743183s] TRACE(dyld::reloc): resolved symbol "rpc_recv"
[ 2790.748738s] TRACE(dyld::reloc): resolved symbol "rtio_init"
[ 2790.754381s] TRACE(dyld::reloc): resolved symbol "rtio_get_counter"
[ 2790.760638s] TRACE(dyld::reloc): resolved symbol "at_mu"
[ 2790.765933s] TRACE(dyld::reloc): resolved symbol "delay_mu"
[ 2790.771488s] TRACE(dyld::reloc): resolved symbol "now_mu"
[ 2790.776871s] DEBUG(ksupport::kernel::core1): kernel loaded
[ 2790.782513s] INFO(ksupport::kernel::core1): kernel starting
[ 2790.788152s] TRACE(ksupport::eh_artiq): reset exception buffer
[ 2790.793977s] TRACE(ksupport::rpc): send<2>(String)->None
[ 2790.799522s] TRACE(runtime::rpc_async): recv ...->None
[ 2791.124464s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] communication failed (timed out)
[ 2791.332460s] ERROR(runtime::rtio_mgt::drtio): [DEST#2] communication failed (timed out)
[ 2791.741451s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] communication failed (timed out)
[ 2791.804749s] TRACE(ksupport::rpc): send<2>(String)->None
[ 2791.810497s] TRACE(runtime::rpc_async): recv ...->None
[ 2791.815624s] TRACE(ksupport::rpc): send<1>(None)->None
[ 2791.820746s] INFO(ksupport::kernel::core1): kernel finished
[ 2791.830853s] INFO(runtime::comms): peer closed connection
[ 2791.948455s] ERROR(runtime::rtio_mgt::drtio): [DEST#2] communication failed (timed out)
UART on the satellite (unable to quickly set to TRACE level due to lacking artiq_coremgmt
support for satellites) will, in case of failure report on the master, show:
WARN(satman): received unexpected aux packet: EchoReply
Any hints on the potential cause or workaround?