The problem:

artiq_coremgmt -s 0 log succeeds, but artiq_coremgmt -s 1 log fails with OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent) and the following lines appear in the master's UART log:

[    72.712499s]  INFO(runtime::mgmt): received connection
[    72.726959s] ERROR(runtime::mgmt::remote_coremgmt): aux packet error (link down)
[    72.734340s]  WARN(runtime::mgmt): connection terminated: DrtioError(LinkDown)

Moreover, neither the master's nor the satellite's log contains any mention of attempting to establish a link with each other.

For a full picture of what's going on, see the full UART logs further below.

This behavior is 100% reproducible no matter:

  • which fiber/cable/transceivers we use to connect SFP0 (master) to SFP0 (satellite),
  • which power supply we use for either the master or the satellite,
  • which SD card we use for either the master or the satellite,
  • which of the 2 Kasli SoC boards is master and which is satellite.

Edit from later

We transferred the SD cards from these two Kasli SoCs to two other Kasli SoCs and there the DRTIO links were established automatically and artiq_coremgmt -s 1 log succeeds. Could the former two Kasli SoCs have a hardware problem?

Our system:

We have two Kasli SoC v1.1.1 boards flashed with artiq-zynq commit 63157588bb8ed214bc9f24ade2d09b4ddefada03 from 2025_01_15:

  • master with 12x DIO_SMA peripherals
    (We have tested both with and without actual peripherals plugged into its EEM sockets.)
  • satellite with 2x Fastino peripherals
    (We have tested both with and without actual peripherals plugged into its EEM sockets.)
  • SFP0 (master) is connected to SFP0 (satellite).
  • We communicate with the master via Ethernet using artiq commit d1177836f2ab7248befecd5403041b2ca2e36746 from 2025_02_07.

Full UART log of master

Contains responses to 1 successful artiq_coremgmt -s 0 log followed by 3 failed artiq_coremgmt -s 1 log:

[     0.019996s]  INFO(szl): Simple Zynq Loader starting...
[     0.025201s] DEBUG(libboard_zynq::clocks::source): Set ARM_PLL to 2000000000 Hz
[     0.007041s] DEBUG(libboard_zynq::clocks::source): Set IO_PLL to 1000000000 Hz
[     0.016260s] DEBUG(libboard_zynq::clocks::source): Set DDR_PLL to 1066666666 Hz
[     0.023531s] DEBUG(libboard_zynq::ddr): DDR 3x/2x clocks: 533333328/355555552
[     0.030752s] DEBUG(libboard_zynq::ddr): DDR DCI clock: 10062892 Hz (divisors=2*53)
[     0.042001s] DEBUG(libboard_zynq::sdio): Reset SDIO!
[     0.046948s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 400000
[     0.053896s]  INFO(szl): Card inserted. Mounting file system.
[     0.073372s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 25000000
[     0.080427s] DEBUG(libboard_zynq::sdio::sd_card): Getting bus width
[     0.086942s] DEBUG(libboard_zynq::sdio::sd_card): 4 bit support
[     0.092839s] DEBUG(libboard_zynq::sdio::sd_card): Changing bus width
[     0.101001s] DEBUG(libboard_zynq::sdio): Set block size to 512
[     0.107352s] DEBUG(libconfig::sd_reader): Partition ID: C
[     0.113805s]  INFO(szl): Loading gateware
[     0.118308s] DEBUG(libconfig::bootgen): Partition header pointer = C80
[     0.125112s] DEBUG(libconfig::bootgen): Unencrypted length = EB1B0
[     0.131274s] DEBUG(libconfig::bootgen): Partition start address: B360
[     0.562538s] DEBUG(libboard_zynq::devc): Invalidate DCache for bitstream buffer
[     0.576735s] DEBUG(libboard_zynq::devc): Init preload FPGA
[     0.582200s] DEBUG(libboard_zynq::devc): Toggling PROG_B
[     0.617489s] DEBUG(libboard_zynq::devc): Waiting for done
[     0.622867s] DEBUG(libboard_zynq::devc): Init postload FPGA
[     0.628343s]  INFO(szl): Loading runtime
[     0.632546s] DEBUG(libconfig::bootgen): Partition header pointer = C80
[     0.639350s] DEBUG(libconfig::bootgen): Unencrypted length = AD9D
[     0.645427s] DEBUG(libconfig::bootgen): Unencrypted length = 43A2C
[     0.651588s] DEBUG(libconfig::bootgen): Partition start address: F6510
[     0.782958s]  INFO(szl): Preparing for runtime execution
[     0.788597s]  INFO(szl): executing payload
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005248s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020340s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.170589s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.179442s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.188643s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.579377s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     5.830667s]  INFO(libboard_artiq::si5324):   ...locked
[     5.885989s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     5.897953s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     5.933535s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-90-ef IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:90ef IPv6: no configured address
[     5.960253s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     5.970504s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     5.982697s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     5.990587s]  WARN(ksupport): error reading device map (Configuration key `device_map` not found), device names will not be available in RTIO error messages
[     6.006015s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[     6.014044s]  INFO(runtime::comms): no idle kernel found
[    10.505082s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    15.243863s]  INFO(runtime::mgmt): received connection
[    19.328225s]  INFO(runtime::mgmt): received connection
[    19.343072s] ERROR(runtime::mgmt::remote_coremgmt): aux packet error (link down)
[    19.350457s]  WARN(runtime::mgmt): connection terminated: DrtioError(LinkDown)
[    20.514414s]  INFO(runtime::mgmt): received connection
[    20.529064s] ERROR(runtime::mgmt::remote_coremgmt): aux packet error (link down)
[    20.536370s]  WARN(runtime::mgmt): connection terminated: DrtioError(LinkDown)
[    21.500834s]  INFO(runtime::mgmt): received connection
[    21.516064s] ERROR(runtime::mgmt::remote_coremgmt): aux packet error (link down)
[    21.523442s]  WARN(runtime::mgmt): connection terminated: DrtioError(LinkDown)

Full UART log of satellite

The following is simultaneous in time to the above log from the master:

[     0.019990s]  INFO(szl): Simple Zynq Loader starting...
[     0.025196s] DEBUG(libboard_zynq::clocks::source): Set ARM_PLL to 2000000000 Hz
[     0.007039s] DEBUG(libboard_zynq::clocks::source): Set IO_PLL to 1000000000 Hz
[     0.016260s] DEBUG(libboard_zynq::clocks::source): Set DDR_PLL to 1066666666 Hz
[     0.023609s] DEBUG(libboard_zynq::ddr): DDR 3x/2x clocks: 533333328/355555552
[     0.030769s] DEBUG(libboard_zynq::ddr): DDR DCI clock: 10062892 Hz (divisors=2*53)
[     0.042000s] DEBUG(libboard_zynq::sdio): Reset SDIO!
[     0.046947s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 400000
[     0.053895s]  INFO(szl): Card inserted. Mounting file system.
[     0.073371s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 25000000
[     0.080504s] DEBUG(libboard_zynq::sdio::sd_card): Getting bus width
[     0.086958s] DEBUG(libboard_zynq::sdio::sd_card): 4 bit support
[     0.092856s] DEBUG(libboard_zynq::sdio::sd_card): Changing bus width
[     0.101001s] DEBUG(libboard_zynq::sdio): Set block size to 512
[     0.107346s] DEBUG(libconfig::sd_reader): Partition ID: C
[     0.113583s]  INFO(szl): Loading gateware
[     0.117863s] DEBUG(libconfig::bootgen): Partition header pointer = C80
[     0.124658s] DEBUG(libconfig::bootgen): Unencrypted length = E7C59
[     0.130822s] DEBUG(libconfig::bootgen): Partition start address: B360
[     0.554995s] DEBUG(libboard_zynq::devc): Invalidate DCache for bitstream buffer
[     0.569170s] DEBUG(libboard_zynq::devc): Init preload FPGA
[     0.574639s] DEBUG(libboard_zynq::devc): Toggling PROG_B
[     0.609521s] DEBUG(libboard_zynq::devc): Waiting for done
[     0.614898s] DEBUG(libboard_zynq::devc): Init postload FPGA
[     0.620452s]  INFO(szl): Loading runtime
[     0.624645s] DEBUG(libconfig::bootgen): Partition header pointer = C80
[     0.631439s] DEBUG(libconfig::bootgen): Unencrypted length = AD9D
[     0.637510s] DEBUG(libconfig::bootgen): Unencrypted length = 2F7A8
[     0.643671s] DEBUG(libconfig::bootgen): Partition start address: F2FC0
[     0.739063s]  INFO(szl): Preparing for runtime execution
[     0.744701s]  INFO(szl): executing payload
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;crate_mqva_ao_kasli_soc_v1.1_artiq_zynq_master_6315758
[     0.020873s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534652s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.465227s]  INFO(libboard_artiq::si5324):   ...locked
[     2.570436s]  INFO(satman): Switching SYS clocks...
[     2.625298s]  INFO(satman): SYS CLK switched successfully
[     2.650379s]  INFO(satman): SED spreading disabled by default

Ok, I must be doing something wrong:

  • We used to run on 3 Sinara crates, each with a Kasli v2.0.1.
  • I put the original Kaslis back into their respective crates. I have not flashed them or done anything else to them. They are in exactly the state that used to work without problem. This is confirmed by the gateware identifications in the UART logs below.
  • Now, everything works on the master but the two satellites are not seen, i.e. I always get artiq.coredevice.exceptions.RTIODestinationUnreachable: RTIO destination unreachable, output, at 150084880248 mu, channel 65536 and artiq.coredevice.exceptions.RTIODestinationUnreachable: RTIO destination unreachable, output, at 10621998936 mu, channel 131072.
  • Even worse, the UART logs of all 3 Kaslis do not mention any attempt at establishing a DRTIO link among them. Simply nothing. See the UART logs at the very bottom.

device_db.py and code

Photos of Kaslis

Master:

{kasli-master.jpg}

Satellite 1:

{kasli-satellite-1.jpg}

Satellite 2:

{kasli-satellite-2.jpg}

Error messages

Trying to run smth on satellite 1:

$ artiq_run do_stuff.py                                                                                                                                                                                                                                                 [20:28:03]
INFO: DIO init() complete
WARNING:artiq.coredevice.comm_kernel:sequence error(s) reported during kernel execution
Core Device Traceback:
Traceback (most recent call first):
  File "ksupport/rtio.rs", line 74, column 14, in (Rust function)
    <unknown>
                 ^
  File "do_stuff.py", line 119, in artiq_run_do_stuff.TestSatellites.run(..., ...) (RA=+0x1dc)
    self.init()
  File "<artiq>/coredevice/fastino.py", line 94, in ... artiq.coredevice.fastino.Fastino.write<artiq.coredevice.fastino.Fastino>(...) (RA=+0x798)
    rtio_output(self.channel | addr, data)
  File "<artiq>/coredevice/fastino.py", line 208, in ... artiq.coredevice.fastino.Fastino.set_cfg<artiq.coredevice.fastino.Fastino>(...) (inlined)
    self.write(0x22, (reset << 0) | (afe_power_down << 1) |
  File "<artiq>/coredevice/fastino.py", line 72, in ... artiq.coredevice.fastino.Fastino.init<artiq.coredevice.fastino.Fastino>(...) (inlined)
    self.set_cfg(reset=0, afe_power_down=0, dac_clr=0, clr_err=1)
  File "do_stuff.py", line 67, in ... artiq_run_do_stuff.TestSatellites.init<artiq_run_do_stuff.TestSatellites>(...) (inlined)
    fastino.init()
artiq.coredevice.exceptions.RTIODestinationUnreachable(0): RTIO destination unreachable, output, at 150084880248 mu, channel 65536

End of Core Device Traceback

Traceback (most recent call last):
  File "/home/m15/.pyenv/versions/3.12.9/bin/artiq_run", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/m15/workspace/artiq/artiq/frontend/artiq_run.py", line 224, in main
    return run(with_file=True)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/m15/workspace/artiq/artiq/frontend/artiq_run.py", line 210, in run
    raise exn
  File "/home/m15/workspace/artiq/artiq/frontend/artiq_run.py", line 203, in run
    exp_inst.run()
  File "/home/m15/workspace/artiq/artiq/language/core.py", line 54, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/m15/workspace/artiq/artiq/coredevice/core.py", line 143, in run
    self._run_compiled(kernel_library, embedding_map, symbolizer, demangler)
  File "/home/m15/workspace/artiq/artiq/coredevice/core.py", line 133, in _run_compiled
    self.comm.serve(embedding_map, symbolizer, demangler)
  File "/home/m15/workspace/artiq/artiq/coredevice/comm_kernel.py", line 718, in serve
    self._serve_exception(embedding_map, symbolizer, demangler)
  File "/home/m15/workspace/artiq/artiq/coredevice/comm_kernel.py", line 700, in _serve_exception
    raise python_exn
artiq.coredevice.exceptions.RTIODestinationUnreachable: RTIO destination unreachable, output, at 150084880248 mu, channel 65536

Trying to run smth on satellite 2:

$ artiq_run do_stuff.py                                                                                                                                                                                                                                                 [20:32:20]
INFO: DIO init() complete
INFO: Fastino init() complete
WARNING:artiq.coredevice.comm_kernel:sequence error(s) reported during kernel execution
Core Device Traceback:
Traceback (most recent call first):
  File "ksupport/rtio.rs", line 74, column 14, in (Rust function)
    <unknown>
                 ^
  File "do_stuff.py", line 119, in artiq_run_do_stuff.TestSatellites.run(..., ...) (RA=+0x1dc)
    self.init()
  File "do_stuff.py", line 77, in ... artiq_run_do_stuff.TestSatellites.init<artiq_run_do_stuff.TestSatellites>(...) (RA=+0x824)
    urukul_cpld.init()
  File "<artiq>/coredevice/spi2.py", line 169, in ... artiq.coredevice.spi2.SPIMaster.set_config_mu<artiq.coredevice.spi2.SPIMaster>(...) (RA=+0xc68)
    rtio_output((self.channel << 8) | SPI_CONFIG_ADDR, flags |
  File "<artiq>/coredevice/urukul.py", line 229, in ... artiq.coredevice.urukul.CPLD.sta_read<artiq.coredevice.urukul.CPLD>(...) (inlined)
    self.bus.set_config_mu(SPI_CONFIG | spi.SPI_END | spi.SPI_INPUT, 24,
  File "<artiq>/coredevice/urukul.py", line 250, in ... artiq.coredevice.urukul.CPLD.init<artiq.coredevice.urukul.CPLD>(...) (inlined)
    proto_rev = urukul_sta_proto_rev(self.sta_read())
artiq.coredevice.exceptions.RTIODestinationUnreachable(0): RTIO destination unreachable, output, at 10621998936 mu, channel 131072

End of Core Device Traceback

Traceback (most recent call last):
  File "/home/m15/.pyenv/versions/3.12.9/bin/artiq_run", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/m15/workspace/artiq/artiq/frontend/artiq_run.py", line 224, in main
    return run(with_file=True)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/m15/workspace/artiq/artiq/frontend/artiq_run.py", line 210, in run
    raise exn
  File "/home/m15/workspace/artiq/artiq/frontend/artiq_run.py", line 203, in run
    exp_inst.run()
  File "/home/m15/workspace/artiq/artiq/language/core.py", line 54, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/m15/workspace/artiq/artiq/coredevice/core.py", line 143, in run
    self._run_compiled(kernel_library, embedding_map, symbolizer, demangler)
  File "/home/m15/workspace/artiq/artiq/coredevice/core.py", line 133, in _run_compiled
    self.comm.serve(embedding_map, symbolizer, demangler)
  File "/home/m15/workspace/artiq/artiq/coredevice/comm_kernel.py", line 718, in serve
    self._serve_exception(embedding_map, symbolizer, demangler)
  File "/home/m15/workspace/artiq/artiq/coredevice/comm_kernel.py", line 700, in _serve_exception
    raise python_exn
artiq.coredevice.exceptions.RTIODestinationUnreachable: RTIO destination unreachable, output, at 10621998936 mu, channel 131072

Complete UART logs

Master log, including one execution of the experiment:

MiSoC Bootloader
Copyright (c) 2017-2022 M-Labs Limited

Bootloader CRC passed
Gateware ident 7.8179.928ca50;mp01-dio-master
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000000111111111000000000000
Module 0:
00000000001111111111000000000000
Read leveling: 15+-4 14+-5 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000015s]  INFO(runtime): ARTIQ runtime starting...
[     0.003939s]  INFO(runtime): software ident 7.8179.928ca50;mp01-dio-master
[     0.010825s]  INFO(runtime): gateware ident 7.8179.928ca50;mp01-dio-master
[     0.017880s]  INFO(runtime): log level set to INFO by default
[     0.023502s]  INFO(runtime): UART log level set to INFO by default
[     0.161111s]  INFO(runtime::rtio_clocking): using 10MHz reference to make 125MHz RTIO clock with PLL
[     0.424490s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     2.577604s]  INFO(board_artiq::si5324):   ...locked
[     2.608508s]  INFO(runtime): network addresses: MAC=e8-eb-1b-45-dd-76 IPv4=192.168.1.30 IPv6-LL=fe80::eaeb:1bff:fe45:dd76 IPv6=no configured address
[     2.622358s]  INFO(board_artiq::drtio_routing): could not read routing table from configuration, using default
[     2.631074s]  INFO(board_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; }
[     2.644694s]  INFO(runtime::mgmt): management interface active
[     2.656751s]  INFO(runtime::session): accepting network sessions
[     2.669824s]  INFO(runtime::session): running startup kernel
[     2.674440s]  INFO(runtime::session): no startup kernel found
[     2.680107s]  INFO(runtime::session): no connection, starting idle kernel
[     2.687070s]  INFO(runtime::session): no idle kernel found
[     2.692353s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[     8.970754s]  INFO(runtime::mgmt): new connection from 192.168.1.1:43030
[    22.601428s]  INFO(runtime::session): new connection from 192.168.1.1:48484
[    22.667662s]  INFO(runtime::kern_hwreq): resetting RTIO
[    22.839679s] ERROR(runtime::rtio_mgt): RTIO sequence error involving channel 9
[    22.882397s]  INFO(runtime::session): no connection, starting idle kernel
[    22.888436s]  INFO(runtime::session): no idle kernel found

Satellite 1, including one execution of the experiment (which does not appear in the logs!):

MiSoC Bootloader
Copyright (c) 2017-2022 M-Labs Limited

Bootloader CRC passed
Gateware ident 7.8188.02dfb3e;mp01-analog-satellite
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000011111111110000000000000
Module 0:
00000000011111111111000000000000
Read leveling: 13+-5 14+-5 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000003s]  INFO(satman): ARTIQ satellite manager starting...
[     0.005892s]  INFO(satman): software ident 7.8188.02dfb3e;mp01-analog-satellite
[     0.013213s]  INFO(satman): gateware ident 7.8188.02dfb3e;mp01-analog-satellite
[     0.421183s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     3.087710s]  INFO(board_artiq::si5324):   ...locked

The Kaslis and Kasli SoCs were all ok. The DRTIO links failed silently because of the SFP(+) hardware, possibly because it was not good enough in some metric that is tested internally. The DRTIO links got magically established after we tried 6 different sets of SFP(+) hardware. We need error messages for this! I created an Artiq issue: Bug report: DRTIO links fail silently; please add log messages! #2681

Related post: Which SFP speeds for different Kasli & Kasli SoC versions?

    11 days later