ARTIQ System crashing/restarting automatically

TThorstenGroh · Jun 29, 2023

Since a few days our ARTIQ system (master + satellite acquired from M-Labs/QUARTIQ, running ARTIQ 7.8123.3038639, Kasli v2.0) stopped working reliably. Nothing was changed to the machine, just a longer weekend without any measurements passed by and now the system does not respond to any measurement scripts coming from artiq_dashboard or artiq_run, just failing with a TimeOutError

root:Terminating with exception (TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond)

The device in this state no longer responds to artiq_coremgmt, I can therefore not get the logs or reboot the device remotely. Only a manual power cycle of the device un- und replugging the power brick from satellite and master brings the system back to life after a few minutes of wait time.

However the system now regularly crashes on it's own without doing anything to it. Pinging the device regularly (ping 10.5.1.1 -t ) shows that the system is online for a few minutes (everything from 3 mins to up to 45 mins) and then crashes again. After being offline for a few minutes it boots up again. This repeats on it's own the whole day.

During the longer uptime the system replies to ping and I was able to talk to the device, execute measurements on the TTL, DAC and DDS boards with reasonable output and I could request logs via artiq_coremgmt log and the system shows that it just recently started up. Here an example:

(artiq-7) PS C:\Users\quMercury\Software\qumercury> artiq_coremgmt log
[     0.000014s]  INFO(runtime): ARTIQ runtime starting...
[     0.003939s]  INFO(runtime): software ident 7.8123.3038639;bonn2master
[     0.010475s]  INFO(runtime): gateware ident 7.8123.3038639;bonn2master
[     0.017049s]  INFO(runtime): log level set to INFO by default
[     0.022776s]  INFO(runtime): UART log level set to INFO by default
[     0.139583s]  WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default.
[     0.147687s]  WARN(runtime::rtio_clocking): si5324_ext_ref and ext_ref_frequency compile-time options are deprecated. Please use the rtio_clock coreconfig settings instead.
[     0.163095s]  INFO(runtime::rtio_clocking): using 10MHz reference to make 125MHz RTIO clock with PLL
[     0.427778s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     2.581250s]  INFO(board_artiq::si5324):   ...locked
[     2.586386s]  INFO(runtime::rtio_clocking::crg): Using internal RTIO clock
[     2.617617s]  INFO(runtime): network addresses: MAC=e8-eb-1b-46-37-7a IPv4=10.5.1.1 IPv6-LL=fe80::eaeb:1bff:fe46:377a IPv6=no configured address
[     2.632594s]  INFO(board_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; }
[     2.657526s]  INFO(runtime::mgmt): management interface active
[     2.669668s]  INFO(runtime::session): accepting network sessions
[     2.696750s]  INFO(runtime::session): running startup kernel
[     2.701246s]  INFO(runtime::session): no startup kernel found
[     2.707041s]  INFO(runtime::session): no connection, starting idle kernel
[     2.713894s]  INFO(runtime::session): no idle kernel found
[     2.719274s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     7.965785s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 27 packets
[     8.051345s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[     8.057810s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[     8.064765s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[     8.071209s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
[     8.278717s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] error(s) found (0x03):
[     8.284559s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] received packet of an unknown type
[     8.292742s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] received truncated packet
[   700.202730s]  INFO(runtime::mgmt): new connection from 10.5.0.1:57898

But more than 50% of the time the device is down, no longer accessible via ping or artiq_coremgmt and I can not access the hardware. Following symptoms could be observed during downtime:

No indication of any red status LED on Kasli or the other boards.
The 10 MHz reference input of the device monitored with a RF coupler shows a small (+0.2 dB) increase in its amplitude correlated to the system being down. We first thought our Rb reference clock would cause this problem (here the amplitude increased even by +0.9 dB and overdriving the 10 MHz ARTIQ input slightly from the default +5.7 dBm level at uptime), but also with an external signal generator (R&S SML02) with varying input amplitudes (+3 dBm to +5.8 dBm) the signal amplitude changes (by +0.2 dB) and the problem persists. I am guessing that this is just a symptom and not causing the problem itself and guess that just relates to the internal 10 MHz reference lock to the external signal not being active when the device is in a fail state. Could that be?
Maybe unrelated: If I try to run print(f"{self.core.get_rtio_destination_status(10) = }") in prepare during uptime the master fails with root:Terminating with exception (ConnectionResetError: Core device connection closed unexpectedly) and artiq_coremgmt log reports [ 1272.477421s] INFO(kernel): panic at ksupport/lib.rs:523:5: Exception(LoadFault) at PC 0x45060164, trap value 0x45061010. I think that worked before, but I don't know whether it is related.

We excluded following potential problems:

No externally connected device is expected to cause this problem. The ARTIQ system was now stripped down to only network connection, fiber link between satellite and master, 10 MHz reference input and power supply connection.
The power supply voltages are constant, I monitored them over night and they are around 12 V without any strange drop outs that could explain the problems.
The network connection to the computer, as well as the computer are not the problem. Both with a different computer, different network cables, routers (as well as direct connections) the problem can be confirmed, the device alternatingly accessed and not accessed any more.
The reference clock signal is stable both in power (except the symptomatic described before) and frequency, with different devices the same problem occurs.
The ground connection to the device does not exhibit strange voltage jumps, the device was put into a separate room with different power outlets/circuit to exclude any background influences both via the (power) cables as well as via radio frequency interference.

I am a little lost what to do.

Do you have any ideas what might be the cause for that problem?
Is there any more diagnostic that I can run via Ethernet or USB?
Can I diagnose the hardware further? Monitor voltages in the device?
Should I attempt to reflash the firmware? I guess that is not a good idea, when the system is in risk to fail during the flashing process.

I am very happy for any help. Thanks!

Rrjo · Jun 29, 2023

At least the LoadFault is #1975, fixed in release-7 commit 75d75cc. Maybe there's more but let's upgrade ARTIQ and bitstream to current release-7 first.

TThorstenGroh · Jun 29, 2023

rjo
Okay, thank. I will try that.

Rrjo · Jun 29, 2023

Sidenote:
For the clock input on Kasli-2 the absolute max input level is 9 dBm (i.e. damage possible above).
Recommended input level: -5 dBm.

TThorstenGroh · Jun 29, 2023

Okay, following a suggestion from Robert I checked the temperatures on the two FPGA boards via artiq_flash -d /tmp start command, connected to the individual boards via USB (+Zadig driver). While the satellite reports temperatures around 60°C, the master Kasli reached temperatures up to 120°C, which is not good.

(artiq-7) PS C:\Users\quMercury\Software\qumercury> artiq_flash -d /tmp start
Open On-Chip Debugger 0.10.0-00013-gbb7bedad (2018-02-17-05:04)
Licensed under GNU GPL v2
For bug reports, read
        http://openocd.org/doc/doxygen/bugs.html
none separate
adapter speed: 25000 kHz
Info : ftdi: if you experience problems at higher adapter clocks, try the command "ftdi_tdo_sample_edge falling"
Info : clock speed 25000 kHz
Info : JTAG tap: xc7.tap tap/device found: 0x13631093 (mfg: 0x049 (Xilinx), part: 0x3631, ver: 0x1)
Info : gdb server disabled
TEMP 121.13 C
VCCINT 0.971 V
VCCAUX 1.781 V
VCCBRAM 0.980 V
VPVN 0.000 V
VREFP 0.000 V
VREFN 0.000 V
VCCPINT 0.000 V
VCCPAUX 0.000 V
VCCODDR 0.000 V

Closer inspection of the small fan cooling the FPGA heatsink shows, that it does not spin up properly even under these temperatures, while the fan in the satellite crate spins much faster in general. So either the fan or the controller might be broken. Any recommendations how to check this? I will just order a replacement fan, that might be a good check.

For now I supported the cooling via an external fan, that keeps the temperatures below 90° at least. I will have an eye on this.

TThorstenGroh · Jun 29, 2023

rjo

I reduced the clock reference signal to -4.7 dBm as recommended. Is there a note in the documentation stating these levels? I just remember that somewhere there was a limit of +6dBm noted.

TThorstenGroh · Jun 29, 2023

rjo

I updated the ARTIQ installation as well as the bitstream to version 7.8173.ff97675 . This worked without problems. - After rebooting the device via artiq_coremgmt reboot it is up again:

(artiq-7-8173) PS C:\Users\quMercury\Software\qumercury> artiq_coremgmt log
[     0.000016s]  INFO(runtime): ARTIQ runtime starting...
[     0.003941s]  INFO(runtime): software ident 7.8173.ff97675;bonn2master
[     0.010475s]  INFO(runtime): gateware ident 7.8173.ff97675;bonn2master
[     0.017051s]  INFO(runtime): log level set to INFO by default
[     0.022779s]  INFO(runtime): UART log level set to INFO by default
[     0.160300s]  WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default.
[     0.168403s]  WARN(runtime::rtio_clocking): si5324_ext_ref and ext_ref_frequency compile-time options are deprecated. Please use the rtio_clock coreconfig settings instead.
[     0.183814s]  INFO(runtime::rtio_clocking): using 10MHz reference to make 125MHz RTIO clock with PLL
[     0.448489s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     2.593523s]  INFO(board_artiq::si5324):   ...locked
[     2.598657s]  INFO(runtime::rtio_clocking::crg): Using internal RTIO clock
[     2.629872s]  INFO(runtime): network addresses: MAC=e8-eb-1b-46-37-7a IPv4=10.5.1.1 IPv6-LL=fe80::eaeb:1bff:fe46:377a IPv6=no configured address
[     2.644845s]  INFO(board_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; }
[     2.669886s]  INFO(runtime::mgmt): management interface active
[     2.681943s]  INFO(runtime::session): accepting network sessions
[     2.695022s]  INFO(runtime::session): running startup kernel
[     2.699495s]  INFO(runtime::session): no startup kernel found
[     2.705280s]  INFO(runtime::session): no connection, starting idle kernel
[     2.712143s]  INFO(runtime::session): no idle kernel found
[     2.717543s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[     3.129345s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     7.164792s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 20 packets
[     7.175952s]  INFO(runtime::mgmt): new connection from 10.5.0.1:62112

This however did not changed anything to the fan performance. But with the external fan, no crashes could be observed in the last hours. So the problem originated in the fan or fan controller failing I guess.

Btw.: What do I have to do about the WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default. warning?

TThorstenGroh · Jul 3, 2023

So indeed the small fan cooling the FPGA package was not working properly anymore. It spinned up from time to time, but got stuck mechanically most of the time. When stuck it heated up extremly fast even not connected to the FPGA, so this made the FPGA chip not happy by adding additional heat and lead to the crashes.

For now I replaced it with a much bigger Noctua NF-A4x20 fan plugged to the same 12 V header as the previous fan and zip tied to the next module carrier slide. This is working quite well, keeping the FPA below 70 °C.

But I will probably add a small 3d printed adapter to hold the fan on the Kasli board for the future.

For other people having the same issues, there seems to be a lot of discussion in the Sinara GitHub issues on the reliability and potential replacement of the small 20x20 mm fan (Radian FI23) that comes preinstalled:
+5v fan output #88
Replacement fan info #102

SSanthosh · Sep 12, 2023

Hello!

I have the exact same issue with my system (Bonn4).
All outputs/inputs of my system turned off at some point (without any errors or faults or notifications) and restarting the box kept them ON for a few minutes only.

Sometimes new sequences couldnt be loaded, and the error was OSError: [Errno 113] No route to host
Sometimes, the sequences ended abruptly with ConnectionResetError: [Errno 104] Connection reset by peer

When running $ artiq_flash start on my ubuntu computer, I saw this (102°C is not good I guess) :

I have 2x fans on my system (on the Kasli main board and on the phasor). I opened the crate and found out that both of them are not working.

Temporary fix:
I have inserted a small aluminum plate with a Noctua NF-A4x20 fan (I borrowed it from the above Thorsten) blowing right into the kasli heatsink. This has brought the temperature down to _77°C. Its not ideal, but works for now. I dont have a fan that fits into the tight spot where the phasor is, so I am not using the card for the moment.

Can someone recommend me a more permanent solution?

Furthermore, I have plugged in a NTC temperature sensor into the kasli heatsink to monitor the temperature externally. I was wondering if there is a command to get the temperature of the board/chip from inside a sequence? I can then log this periodically to an external database.

Clear skies,
Santhosh

Rrjo · Sep 13, 2023

The fans die very rapidly if they themselves are subjected to overtemperature. That's often the case if there is no adequate convective cooling, e.g. if the bottom or the top of the crate are blocked and there is no.forced cooling of the crate
The permanent solution is to replace the fans and provide sufficient convective or forced cooling.
Contact me for replacement fans in your case.