One of three Kasli crates in our experiments has started to crash and become unresponsive and we need help diagnosing what went wrong.
This behaviour manifests when using an experimental sequence which writes to DMA ~ every minute and plays back the DMA sequence once per second on an AD9910 Urukul DDS. Typically we would run one of these experiments for an hour, process some data within a few minutes and then submit a new iteration of the same experiment. Until these crashes started, we were able to run this sequence for up to eight hours without any issues. The crashes started a week ago and they occur when this experimental sequence has been running for approximately two hours. No significant software or hardware changes have been made to the system in this time period.
In attempting to diagnose the issue, we started logging via UART / Putty. I'm showing what I think are the relevant excerpts from the log files here. The error message was exactly the same the first two times the error was logged, but the most recent occurrence reported a different error.
Here is the error that occurred multiple times:
`[ 20863.851353s] INFO(runtime::session): new connection from 192.168.1.44:56007
[ 20863.927180s] INFO(runtime::kern_hwreq): resetting RTIO
[ 20941.497216s] INFO(runtime::session): no connection, starting idle kernel
[ 20941.503100s] INFO(runtime::session): no idle kernel found
[ 20942.332304s] INFO(runtime::session): new connection from 192.168.1.44:58218
[ 20942.408708s] INFO(runtime::kern_hwreq): resetting RTIO
@ 0x40032a7c
+0000: 44004800 bc030000 10000004 15000000
+0010: 07ffb6f0 846101c0 846102a0 bc030006
+0020: 1000009d 15000000 bc230004 1000004d
+0030: 15000000 8c6102a4 bc230003 10000049
@ 0x40167f3c
+0000: 9ca001d1 d4012810 9ca00069 d401280c
+0010: 9ca0001d 00000001 00000001 401542f0
+0020: 00000001 1860ffff a863bc08 e0632000
+0030: d4011800 1860ffff a863bc3c e0632000
panic at runtime/main.rs:290:13: exception Alignment at PC 0x40032a7c, EA 0x40167f3c
backtrace for software version 6.7607.a80c35a6;Kasli_VenusV3_Urukulv1.5Mirny:
0x4004d61c
0x4001feb0
0x4004ced4
0x400010d0
halting.
use artiq_coremgmt config write -s panic_reset 1
to restart instead `
And here is the most recent error which has so far only occurred once:
`[ 78498.089981s] INFO(runtime::session): new connection from 192.168.1.44:54025
[ 78498.164992s] INFO(runtime::kern_hwreq): resetting RTIO
@ 0x401521fc
+0000: 00000001 401c4244 00000008 00000006
+0010: e0d9f5b4 7dfa2583 feedfeed 00001008
+0020: 40153228 136cc767 09771d9a 1150d031
+0030: a5460bdc b233358b eaf10d6e 85c475f5
@ 0x401686ec
+0000: 401521f8 40403004 4015e490 40036098
+0010: 401c4744 00000001 00000001 401542f0
+0020: 00000001 40035c5c 00000316 00000000
+0030: 00000000 00001860 00000000 401521f8
panic at runtime/main.rs:290:13: exception IllegalInsn at PC 0x401521fc, EA 0x401686ec
backtrace for software version 6.7607.a80c35a6;Kasli_VenusV3_Urukulv1.5Mirny:
0x4004d61c
0x4001feb0
0x4004ced4
0x400010d0
halting.
use artiq_coremgmt config write -s panic_reset 1
to restart instead `
Could you please advise how I can interpret the output and diagnose the cause of the crashes?
Operating System: Windows 10 Enterprise 64 bit
artiq version: ARTIQ v6.7605.65f0951f
Version of the gateware and runtime loaded in the core device:
software ident 6.7607.a80c35a6;Kasli_VenusV3_Urukulv1.5Mirny
gateware ident 6.7607.a80c35a6;Kasli_VenusV3_Urukulv1.5Mirny
Hardware involved: Urukul AD9910, Mirny
I tried uploading the full logs, device_db and conda environment list but I get the error 'Uploading files of this type is not allowed' for any file type I tried.