I have recently been seeing the following error occur a few minutes after we start an experiment:
Traceback (most recent call last):
File "C:\Users\yblab\Anaconda3\envs\artiq6\Scripts\artiq_run-script.py", line 9, in <module>
sys.exit(main())
File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\frontend\artiq_run.py", line 224, in main
return run(with_file=True)
File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\frontend\artiq_run.py", line 210, in run
raise exn
File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\frontend\artiq_run.py", line 203, in run
exp_inst.run()
File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\language\core.py", line 54, in run_on_core
return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\core.py", line 137, in run
self.comm.serve(embedding_map, symbolizer, demangler)
File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 651, in serve
self._read_header()
File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 243, in _read_header
sync_byte = self._read(1)[0]
File "C:\Users\yblab\Anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 229, in _read
new_buffer = self.socket.recv(8192, flag)
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
For context, the structure of our EnvExperiment run() is instructions for a 1s experimental cycle inside an indefinite while loop. After each cycle, RPC calls to the host post/get information to/from a different socket on the host. There is a realtime break for host/core communication. A previous researcher on my project wrote a GUI-to-ARTIQ-sequence code generator (which is run by the same process listening for RPCs on the other socket on the host); after setting up the cycle in the GUI, the experiment starts via artiq_run on the generated code file.
We've seen this error before in our experiment, and it seemed to be correlated with how much information we ask RPC calls to the host to send to the other socket. We have previously avoided (but not solved) the issue simply by asking for fewer values to be transferred this way. With this strategy, we normally don't see any issues for many thousands of cycles. Starting yesterday, the problem seems to be more persistent, and the workaround no longer works. I have noticed that there seems to be 10s between the last successful experimental cycle and when the exception is raised - I don't know if this is due to a keepalive timeout.
What's the best way to fix this? Is there an easy way to keep the connection alive in a way that lets me keep using our legacy code with minimal adjustment, or should I think about more fundamental changes to the way we schedule experimental cycles and sync information with the host (perhaps through the native ARTIQ management system)?
UART Logs
For completeness, I also restarted our ARTIQ crate and looked at the UART log on startup:
`
MiSoC Bootloader
Copyright (c) 2017-2020 M-Labs Limited
Bootloader CRC passed
Gateware ident 6.7553.2f5ea67b.beta;nist2
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000000111111111110000000000
Module 0:
00000000000111111111110000000000
Read leveling: 16+-5 16+-5 done
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000009s] INFO(runtime): ARTIQ runtime starting...
[ 0.003932s] INFO(runtime): software ident 6.7553.2f5ea67b.beta;nist2
[ 0.010475s] INFO(runtime): gateware ident 6.7553.2f5ea67b.beta;nist2
[ 0.017051s] INFO(runtime): log level set to INFO by default
[ 0.022758s] INFO(runtime): UART log level set to INFO by default
[ 0.140290s] INFO(runtime::rtio_clocking): using internal RTIO clock (by default)
[ 0.417749s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 3.691771s] INFO(board_artiq::si5324): ...locked
[ 3.721156s] INFO(runtime): network addresses: [addresses here]
[ 3.735268s] INFO(runtime::mgmt): management interface active
[ 3.764383s] INFO(runtime::session): accepting network sessions
[ 3.780055s] INFO(runtime::session): running startup kernel
[ 3.826324s] INFO(runtime::kern_hwreq): resetting RTIO
[ 4.027443s] ERROR(runtime::session): exception in flash kernel
[ 4.032092s] ERROR(runtime::session): 0:ValueError: PLL lock timeout [0, 0, 0]
[ 4.039320s] ERROR(runtime::session): at C:\Users\yblab\AppData\Local\Continuum\anaconda3\envs\artiq6\lib\site-packages\artiq\coredevice\ad9910.py:460:24 in _Z35artiq.coredevice.ad9910.AD9910.initI30artiq.coredevice.ad9910.AD9910Ezz
[ 4.059954s] INFO(runtime::session): startup kernel finished
[ 4.066328s] INFO(runtime::session): no connection, starting idle kernel
[ 4.114211s] INFO(runtime::kern_hwreq): resetting RTIO
`
The error LED on the front panel seems to remain off even after the connection error, and no errors other than the connection error are raised when I try to run an experiment. Should I be worried about any of the lines flagged "error"?
Any advice would be appreciated! Thanks!