Intermittent ConnectionResetError [WinError 10054]

Eeswiler · Aug 10, 2024

I have recently been seeing the following error occur a few minutes after we start an experiment:

Traceback (most recent call last): File "C:\Users\yblab\Anaconda3\envs\artiq6\Scripts\artiq_run-script.py", line 9, in <module> sys.exit(main()) File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\frontend\artiq_run.py", line 224, in main return run(with_file=True) File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\frontend\artiq_run.py", line 210, in run raise exn File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\frontend\artiq_run.py", line 203, in run exp_inst.run() File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\language\core.py", line 54, in run_on_core return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs) File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\core.py", line 137, in run self.comm.serve(embedding_map, symbolizer, demangler) File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 651, in serve self._read_header() File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 243, in _read_header sync_byte = self._read(1)[0] File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 229, in _read new_buffer = self.socket.recv(8192, flag) ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

For context, the structure of our EnvExperiment run() is instructions for a _1s experimental cycle inside an indefinite while loop. After each cycle, RPC calls to the host post/get information to/from a different socket on the host. There is a realtime break for host/core communication. A previous researcher on my project wrote a GUI-to-ARTIQ-sequence code generator (which is run by the same process listening for RPCs on the other socket on the host); after setting up the cycle in the GUI, the experiment starts via artiq_run on the generated code file.

We've seen this error before in our experiment, and it seemed to be correlated with how much information we ask RPC calls to the host to send to the other socket. We have previously avoided (but not solved) the issue simply by asking for fewer values to be transferred this way. With this strategy, we normally don't see any issues for many thousands of cycles. Starting yesterday, the problem seems to be more persistent, and the workaround no longer works. I have noticed that there seems to be _10s between the last successful experimental cycle and when the exception is raised - I don't know if this is due to a keepalive timeout.

What's the best way to fix this? Is there an easy way to keep the connection alive in a way that lets me keep using our legacy code with minimal adjustment, or should I think about more fundamental changes to the way we schedule experimental cycles and sync information with the host (perhaps through the native ARTIQ management system)?

UART Logs

For completeness, I also restarted our ARTIQ crate and looked at the UART log on startup:
`
MiSoC Bootloader
Copyright (c) 2017-2020 M-Labs Limited

Bootloader CRC passed
Gateware ident 6.7553.2f5ea67b.beta;nist2
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000000111111111110000000000
Module 0:
00000000000111111111110000000000
Read leveling: 16+-5 16+-5 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[ 0.000009s] INFO(runtime): ARTIQ runtime starting...
[ 0.003932s] INFO(runtime): software ident 6.7553.2f5ea67b.beta;nist2
[ 0.010475s] INFO(runtime): gateware ident 6.7553.2f5ea67b.beta;nist2
[ 0.017051s] INFO(runtime): log level set to INFO by default
[ 0.022758s] INFO(runtime): UART log level set to INFO by default
[ 0.140290s] INFO(runtime::rtio_clocking): using internal RTIO clock (by default)
[ 0.417749s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 3.691771s] INFO(board_artiq::si5324): ...locked
[ 3.721156s] INFO(runtime): network addresses: [addresses here]
[ 3.735268s] INFO(runtime::mgmt): management interface active
[ 3.764383s] INFO(runtime::session): accepting network sessions
[ 3.780055s] INFO(runtime::session): running startup kernel
[ 3.826324s] INFO(runtime::kern_hwreq): resetting RTIO
[ 4.027443s] ERROR(runtime::session): exception in flash kernel
[ 4.032092s] ERROR(runtime::session): 0:ValueError: PLL lock timeout [0, 0, 0]
[ 4.039320s] ERROR(runtime::session): at C:\Users\yblab\AppData\Local\Continuum\anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\ad9910.py:460:24 in _Z35artiq.coredevice.ad9910.AD9910.initI30artiq.coredevice.ad9910.AD9910Ezz
[ 4.059954s] INFO(runtime::session): startup kernel finished
[ 4.066328s] INFO(runtime::session): no connection, starting idle kernel
[ 4.114211s] INFO(runtime::kern_hwreq): resetting RTIO
`
The error LED on the front panel seems to remain off even after the connection error, and no errors other than the connection error are raised when I try to run an experiment. Should I be worried about any of the lines flagged "error"?

Any advice would be appreciated! Thanks!

Eeswiler · Aug 12, 2024

Tentatively fixed by setting the keepalive timeout + interval to smaller values in ARTIQ 6's initialize_connection in artiq.coredevice.comm.py. From
set_keepalive(sock, 10, 10, 3)
to
set_keepalive(sock, 0.25, 0.25, 3)
I'll monitor whether the error reappears under these new parameters.

Eeswiler · Sep 6, 2024

This problem seems to have popped up again. Looking at the network traffic between host and core device:

The things that look different to me on the cycle that fails are (counting from the previous last duplicate acknowledgement):

ARTIQ doesn't send a keepalive after the second package exchange
The third packet isn't the typical size, and gets retransmitted twice, with no acknowledgement from host
The host sends a keepalive request
ARTIQ immediately terminates the connection after receiving the keepalive request

Any insight would be appreciated!

esavkin · Sep 26, 2024

This may happen due to overheating. Check the fans is spinning full speed and also you may try enforce cooling by using external ones.