Since a few days our ARTIQ system (master + satellite acquired from M-Labs/QUARTIQ, running ARTIQ 7.8123.3038639
, Kasli v2.0) stopped working reliably. Nothing was changed to the machine, just a longer weekend without any measurements passed by and now the system does not respond to any measurement scripts coming from artiq_dashboard
or artiq_run
, just failing with a TimeOutError
root:Terminating with exception (TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond)
The device in this state no longer responds to artiq_coremgmt
, I can therefore not get the logs or reboot the device remotely. Only a manual power cycle of the device un- und replugging the power brick from satellite and master brings the system back to life after a few minutes of wait time.
However the system now regularly crashes on it's own without doing anything to it. Pinging the device regularly (ping 10.5.1.1 -t
) shows that the system is online for a few minutes (everything from 3 mins to up to 45 mins) and then crashes again. After being offline for a few minutes it boots up again. This repeats on it's own the whole day.
During the longer uptime the system replies to ping and I was able to talk to the device, execute measurements on the TTL, DAC and DDS boards with reasonable output and I could request logs via artiq_coremgmt log
and the system shows that it just recently started up. Here an example:
(artiq-7) PS C:\Users\quMercury\Software\qumercury> artiq_coremgmt log
[ 0.000014s] INFO(runtime): ARTIQ runtime starting...
[ 0.003939s] INFO(runtime): software ident 7.8123.3038639;bonn2master
[ 0.010475s] INFO(runtime): gateware ident 7.8123.3038639;bonn2master
[ 0.017049s] INFO(runtime): log level set to INFO by default
[ 0.022776s] INFO(runtime): UART log level set to INFO by default
[ 0.139583s] WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default.
[ 0.147687s] WARN(runtime::rtio_clocking): si5324_ext_ref and ext_ref_frequency compile-time options are deprecated. Please use the rtio_clock coreconfig settings instead.
[ 0.163095s] INFO(runtime::rtio_clocking): using 10MHz reference to make 125MHz RTIO clock with PLL
[ 0.427778s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 2.581250s] INFO(board_artiq::si5324): ...locked
[ 2.586386s] INFO(runtime::rtio_clocking::crg): Using internal RTIO clock
[ 2.617617s] INFO(runtime): network addresses: MAC=e8-eb-1b-46-37-7a IPv4=10.5.1.1 IPv6-LL=fe80::eaeb:1bff:fe46:377a IPv6=no configured address
[ 2.632594s] INFO(board_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; }
[ 2.657526s] INFO(runtime::mgmt): management interface active
[ 2.669668s] INFO(runtime::session): accepting network sessions
[ 2.696750s] INFO(runtime::session): running startup kernel
[ 2.701246s] INFO(runtime::session): no startup kernel found
[ 2.707041s] INFO(runtime::session): no connection, starting idle kernel
[ 2.713894s] INFO(runtime::session): no idle kernel found
[ 2.719274s] INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[ 7.965785s] INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 27 packets
[ 8.051345s] INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[ 8.057810s] INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[ 8.064765s] INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[ 8.071209s] INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
[ 8.278717s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] error(s) found (0x03):
[ 8.284559s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] received packet of an unknown type
[ 8.292742s] ERROR(runtime::rtio_mgt::drtio): [LINK#0] received truncated packet
[ 700.202730s] INFO(runtime::mgmt): new connection from 10.5.0.1:57898
But more than 50% of the time the device is down, no longer accessible via ping or artiq_coremgmt
and I can not access the hardware. Following symptoms could be observed during downtime:
- No indication of any red status LED on Kasli or the other boards.
- The 10 MHz reference input of the device monitored with a RF coupler shows a small (+0.2 dB) increase in its amplitude correlated to the system being down. We first thought our Rb reference clock would cause this problem (here the amplitude increased even by +0.9 dB and overdriving the 10 MHz ARTIQ input slightly from the default +5.7 dBm level at uptime), but also with an external signal generator (R&S SML02) with varying input amplitudes (+3 dBm to +5.8 dBm) the signal amplitude changes (by +0.2 dB) and the problem persists. I am guessing that this is just a symptom and not causing the problem itself and guess that just relates to the internal 10 MHz reference lock to the external signal not being active when the device is in a fail state. Could that be?
- Maybe unrelated: If I try to run
print(f"{self.core.get_rtio_destination_status(10) = }")
in prepare
during uptime the master fails with root:Terminating with exception (ConnectionResetError: Core device connection closed unexpectedly)
and artiq_coremgmt log
reports [ 1272.477421s] INFO(kernel): panic at ksupport/lib.rs:523:5: Exception(LoadFault) at PC 0x45060164, trap value 0x45061010
. I think that worked before, but I don't know whether it is related.
We excluded following potential problems:
- No externally connected device is expected to cause this problem. The ARTIQ system was now stripped down to only network connection, fiber link between satellite and master, 10 MHz reference input and power supply connection.
- The power supply voltages are constant, I monitored them over night and they are around 12 V without any strange drop outs that could explain the problems.
- The network connection to the computer, as well as the computer are not the problem. Both with a different computer, different network cables, routers (as well as direct connections) the problem can be confirmed, the device alternatingly accessed and not accessed any more.
- The reference clock signal is stable both in power (except the symptomatic described before) and frequency, with different devices the same problem occurs.
- The ground connection to the device does not exhibit strange voltage jumps, the device was put into a separate room with different power outlets/circuit to exclude any background influences both via the (power) cables as well as via radio frequency interference.
I am a little lost what to do.
- Do you have any ideas what might be the cause for that problem?
- Is there any more diagnostic that I can run via Ethernet or USB?
- Can I diagnose the hardware further? Monitor voltages in the device?
- Should I attempt to reflash the firmware? I guess that is not a good idea, when the system is in risk to fail during the flashing process.
I am very happy for any help. Thanks!