Morning, we are trying to run some long-lifetime measurements involving looping over a sequence of TTL pulses over a few 100ms and then inserting a delay of between 10 and 100s before taking images with another set of TTL pulses.

The delay is looping inside the kernel function, but we are finding that once the kernel has been running for 6-8 minutes, it crashes with an "RTIO Destination Unreachable" error. A similar error is obtained running an infinite loop set to toggle a TTL every 100ms as we were unsure if simply adding a 100s delay would cause a problem, which also crashes after a time between 6 and 8 minutes.

Are there details anywhere on the cause of the unreachable error - trying to determine if this is a network issue or something else, apologies if we have missed something in the docs.

We are running using the stable Artiq6 release on both gateware and software, and using a Kasli v1.0 to drive TTLs on DIO_MCX output boards.

Likely the DRTIO link is going down intermittently. You should see corresponding messages in the master and satellite logs. How are you connecting the Kaslis together?

  • jdp replied to this.

    sb10q Thanks Sebastian - that seems to be exactly the problem. The logs are full of "aux packet error (link went down).

    The Kasli crates are configured to use one Master and one Satellite which are connected using https://www.fs.com/products/20184.html, https://www.fs.com/products/40436.html and https://www.fs.com/uk/products/29895.html : 1000Base-Bx BiDi SFP 1490/1310 TX+RX and a 5m single mode fiber

    These were based on recommendations from https://github.com/sinara-hw/meta/wiki/SFP

    I can try cleaning the fiber connectors, is there another recommended way to resolve this?

    Those transceivers you are using are rated for 1.25Gbps maximum. DRTIO is 2.5Gbps or more.
    If you want to avoid this sort of issue, you may buy a complete DRTIO system from us instead of assembling it yourself. We would include suitable transceivers. I wasn't even aware of that wiki page...

    • jdp likes this.

    Also with just 5m of fiber the connectors would have to be really dirty to cause problems 🙂

    • jdp likes this.

    Thanks Sebastian for quick response and debugging for us. I do appreciate that buying directly from you avoids these problems but it doesn't seem unreasonable for the hardware to be reliably documented, I will update wiki with replacement adaptors

    20 days later

    Follow up post: Based on recommendation from what is tried and tested at Oxford I got these https://www.fs.com/uk/products/36353.html and https://www.fs.com/uk/products/36351.html which are rated to 10 GBPS. This has made a big improvement, but we are still seeing relatively frequent "aux packet errors".

    Is there any explicit initialisation/sync code that needs including in experiments to reduce the likelihood of these, given we have removed the earlier hardware limitation?

      jdp Is there any explicit initialisation/sync code that needs including in experiments

      No, aux packet errors should be independent of whatever the kernel is doing.
      I don't know what is wrong with your system.

      • jdp replied to this.
        6 days later

        sb10q I have a stupid question - how do I read the logs on the satellite?

        I can get the master log using the artiq_coremgmt tool, and I can also monitor the UART port on the satellite, but is there an equivalent method to artiq_coremgmt to just read the logs from the satellite without needing to take a performance hit by increasing the UART log level on the satellite or leaving it monitoring all the time?

        Update: We have been monitoring the serial port on satellite and each time the master errors due to aux packet errors this is correlated with the Satellite Kasli restarting, but with no warning or error message printed on the port before hand. Do you have any ideas what might cause a satellite to restart?

        17 days later

        sb10q are you able to give any suggestions for what might cause the sattelite to reboot without an error?

        12 days later

        No, I have never seen such a problem.