"[DEST#N] communication failed (timed out)" with ARTIQ8-Zynq?

evilmav · Aug 21, 2024

We run a custom 3 Kasli-SoC system (master connected to both satellites directly). The gateware is built from the latest artiq-zynq release-8 branch. Cards can be controlled with minimal examples (e.g. LEDs of Urukuls on satellite do turn on and off when sw is set), but it will fail with

ERROR(runtime::rtio_mgt::drtio): [DEST#1] communication failed (timed out)
ERROR(runtime::rtio_mgt::drtio): [DEST#2] communication failed (timed out)

beyond that, including with artiq_sinara_tester.

The minimal experiment causing the issue is simply:

from artiq.experiment import *


class TimeoutExample(EnvExperiment):
    def build(self):
        self.setattr_device("core")

    @kernel
    def run(self):
        print("beginning of the end")
        self.core.reset()
        delay(1.0 * s)
        self.core.wait_until_mu(now_mu())
        print("end of the beginning")

Setting the delay lower will actually make the problem appear less frequently, until it disappears completely. Removing wait_until_mu will also make the log disappear.

Trace on the master would look something like that:

[  2790.725019s] TRACE(dyld::reloc): resolved symbol "rpc_send_async"
[  2790.731102s] TRACE(dyld::reloc): resolved symbol "__artiq_personality"
[  2790.737619s] TRACE(dyld::reloc): resolved symbol "rpc_send"
[  2790.743183s] TRACE(dyld::reloc): resolved symbol "rpc_recv"
[  2790.748738s] TRACE(dyld::reloc): resolved symbol "rtio_init"
[  2790.754381s] TRACE(dyld::reloc): resolved symbol "rtio_get_counter"
[  2790.760638s] TRACE(dyld::reloc): resolved symbol "at_mu"
[  2790.765933s] TRACE(dyld::reloc): resolved symbol "delay_mu"
[  2790.771488s] TRACE(dyld::reloc): resolved symbol "now_mu"
[  2790.776871s] DEBUG(ksupport::kernel::core1): kernel loaded
[  2790.782513s]  INFO(ksupport::kernel::core1): kernel starting
[  2790.788152s] TRACE(ksupport::eh_artiq): reset exception buffer
[  2790.793977s] TRACE(ksupport::rpc): send<2>(String)->None
[  2790.799522s] TRACE(runtime::rpc_async): recv ...->None
[  2791.124464s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] communication failed (timed out)
[  2791.332460s] ERROR(runtime::rtio_mgt::drtio): [DEST#2] communication failed (timed out)
[  2791.741451s] ERROR(runtime::rtio_mgt::drtio): [DEST#1] communication failed (timed out)
[  2791.804749s] TRACE(ksupport::rpc): send<2>(String)->None
[  2791.810497s] TRACE(runtime::rpc_async): recv ...->None
[  2791.815624s] TRACE(ksupport::rpc): send<1>(None)->None
[  2791.820746s]  INFO(ksupport::kernel::core1): kernel finished
[  2791.830853s]  INFO(runtime::comms): peer closed connection
[  2791.948455s] ERROR(runtime::rtio_mgt::drtio): [DEST#2] communication failed (timed out)

UART on the satellite (unable to quickly set to TRACE level due to lacking artiq_coremgmt support for satellites) will, in case of failure report on the master, show:

  WARN(satman): received unexpected aux packet: EchoReply

Any hints on the potential cause or workaround?

sb10q · Aug 22, 2024

Do you have the same firmware version on all boards?

evilmav · Aug 22, 2024

All gatewares have been built from 367061aab8a939cae0b7683b80f43f1d3d87c0f7. Unfortunately, UART output does not seem to contain the version string on boot to verify what it thinks it is though...

EDIT: Reflashed all, definitely same gateware. No joy, same issue. Switched the copper SFP-SFP cable we use to an optical link to see if this is the cause, no joy either.

Aaonkes · Nov 13, 2024

We also ran into the same problem. Unlike @evilmav we use a Kasli-SoC as master and a Kasli v2 as satellite. The DRTIO link is also setup with transceivers recommended in the Sinara wiki on github. We use the most recent version from artiq-zynq ARTIQ v8.8954+4235309 (for building the gateware for both crates and the software on the host).

We used the artiq_sinara_tester and it seems like the error messages (see evilmav) only appear when a TTL device is involved. When the DIO cards or the Urukuls (because of the RF switches, which appear as TTL devices in the device_db) test routines are executed we see these errors. But if we only execute the test routine for the Fastinos there are no errors.

We also switched the roles of the crates, so the Kasli v2 is now the master and the Kasli SoC is the satellite. Then we don't see the communication errors at all when we run the sinara tester.

evilmav · Nov 13, 2024

aonkes

Does my example cause errors for you too? I'm under impression it is not as much matter of TTL or fastino as it is of the IDLE times...

Cchoelzl · Nov 18, 2024

We have the same problem with ARTIQ 8.8973+80ae6f5.
We also use the recommended fiber links and the same setup worked yesterday before we replaced the master kasli 2.0 by the kasli soc with which it now fails.

On the satellite I get this:

[   967.304396s]  WARN(satman): aux packet error (protocol error: unknown packet 0x9a)
[  1204.295618s]  WARN(satman): aux packet error (packet CRC failed)
[  1297.778924s]  WARN(satman): received unexpected aux packet
[  1375.372610s]  WARN(satman): received unexpected aux packet
[  1581.811100s]  WARN(satman): received unexpected aux packet
[  1699.203397s]  WARN(satman): aux packet error (packet CRC failed)
[  1885.340720s]  WARN(satman): received unexpected aux packet
[  1991.075698s]  WARN(satman): aux packet error (protocol error: unknown packet 0xe2)

Cchoelzl · Nov 20, 2024

Did any of you got this fixed?
Any input on how to get it fixed/debug it?

Cchoelzl · Nov 21, 2024

After a while, I get the following error on the satellite kasli 2.0:

panic at /nix/store/1n2jcd6aqajnh7236w6cvqfk1chhhvw5-python3.11-artiq-8.8973+80ae6f5/lib/python3.11/site-packages/artiq/firmware/libproto_artiq/drtioaux_proto.rs:291:40: range end index 65535 out of range for slice of length 1010

and the error LED turns on. The master kasli then starts to spam

ERROR(runtime::rtio_mgt::drtio): [DEST#1] communication failed (timed out)

every 409ms. Is there any monitoring or something polling on this interval?
Because the DRTIO seems to work, at least we could not find a signal yet which is not replayed by the satellite in this state.

evilmav · Nov 25, 2024

Someone has crossposted the issue to m-labs gitea, but there is no activity for 3 months: https://git.m-labs.hk/M-Labs/artiq-zynq/issues/322 .

sb10q · Nov 26, 2024

Fixing this sort of bug isn't easy for anyone.

Cchoelzl · Nov 26, 2024

I fully agree
I see there is a pull request on the git, I will try it out as soon as I have the time.

Ssrenblad · Nov 26, 2024

choelzl With the patch applied, I was unable to reproduce the error with the code from the original post. That being said, ping us or reach out to the helpdesk if you still run into problems.

Cchoelzl · Nov 27, 2024

All in all the errors are less, but we still get some. Also the recurring (every 409ms) from the moninj driver are still there. I start to suspect that it is a hardware fault with the soc. We had it send to quartiq because we had similar issues before and got it back now with a note that it is fixed. Probably it is not...

evilmav · Nov 28, 2024

srenblad

we've updated gateware to the current artiq-zynq release-8 state (incorporating "drtio: add InjectionRequest to expects_response" commit) and the issue is still present with the example code...

(If I could make a Christmas wish: any chance the coremgmt satellite support could be backported to release-8? This would immensely help testing this stuff =) )

evilmav · Dec 18, 2024

The issue is solved after https://git.m-labs.hk/M-Labs/artiq-zynq/pulls/345

Cchoelzl · 9 Jan

Also fixed for us, perfect, thank you!