New Hardware Connection Error

Wwpc3 · Sep 13, 2022

Hi,

I am a grad student at the University of Illinois, working in Brian DeMarco's trapped ion group. We have been using ARTIQ while setting up our experiment for about six months and have recently had a connection error cause many (roughly half) of our experiment runs to fail. The primary error is "TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond" (full trackback below) and appears to arise spontaneously, inhibit the crate connection for about a minute, and then cease. Monitoring on PuTTY, the boot log is posted when the error "ends," and Wireshark indicates no communications with the crate during this time.

Indications that the error is occurring include the above error being thrown when trying to run a script, regardless of its content. Our scripts involve simple DDS and TTL pulse sequences and setting DAC values, and we have encountered "random" DDS and TTL channels turning off mid-sequence as another indication of the error.

I have tried disconnecting everything from our crate and computer except the ethernet communication and USB-UART lines, changing the ethernet cable, and changing the computer's ethernet port without being able to identify the cause.

Error trackback:
Traceback (most recent call last): File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\Scripts\artiq_run-script.py", line 9, in <module> sys.exit(main()) File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\frontend\artiq_run.py", line 224, in main return run(with_file=True) File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\frontend\artiq_run.py", line 210, in run raise exn File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\frontend\artiq_run.py", line 203, in run exp_inst.run() File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\language\core.py", line 54, in run_on_core return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs) File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\core.py", line 132, in run self.comm.check_system_info() File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm_kernel.py", line 342, in check_system_info self._write_empty(Request.SystemInfo) File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm_kernel.py", line 310, in _write_empty self._write_header(ty) File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm_kernel.py", line 302, in _write_header self.open() File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm_kernel.py", line 188, in open self.socket = initialize_connection(self.host, self.port) File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm.py", line 25, in initialize_connection sock = socket.create_connection((host, port)) File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\socket.py", line 808, in create_connection raise err File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\socket.py", line 796, in create_connection sock.connect(sa) TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

Any help would be greatly appreciated!
Thank you,
Will

Ssteine01 · Sep 13, 2022

Hi, yes this is a known issue, but it for me it always required a power cycle to get ARTIQ back working. For now, check out the corresponding issue: https://github.com/m-labs/artiq/issues/1759
I still have to investigate it more closely, but it looks like setting up a dedicated network for the Kasli helps (instead of using a switch in a public network) and for some reason since I commented out the line 142 in C:\Users\ybion443\.conda\envs\artiq-7-1\Lib\site-packages\artiq\firmware\runtime\main.rs (smoltcp::phy::EthernetTracer::new(net_device, net_trace_fn)) I haven't had this error again. This seems to be the offending part, but I do not yet see, why a recompilation of ARTIQ with that line commented out is not necessary to actually have an effect, so I have not updated the issue yet.

Wwpc3 · Sep 13, 2022

steine01 Thanks for the response. Unfortunately, I didn't find commenting that line out to fix the issue. We're also on a private network with the ethernet cable connecting directly from the computer to the crate, so I will need to investigate our particular issue further. I am using ARTIQ-6, so I might want to try your trick on ARTIQ-7 before anything.

Llriesebos · Sep 15, 2022

Hi @wpc3 , Leon here from Duke University. I saw your email to Ken, but will reply using this forum to have the discussion public.

I read through the issue, and I do not think I have seen something like that before, but I'm happy to think along. I was wondering, did you eventually managed to get the UART output from the Kasli over the USB connection? If so, I would like to see those messages.

Wwpc3 · Sep 15, 2022

lriesebos Hi Leon,

Thanks so much. Monitoring the UART with PuTTY informed us of when the crate began communicating and was able to run scripts again (typically for only a minute before the error returned), and the output in that case was the typical boot log:

__  __ _ ____         ____
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |
| |  | | |___) | (_) | |___
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2021 M-Labs Limited

Bootloader CRC passed
Gateware ident 6.7666.20dc923c;illinoismaster
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000000111111111110000000000
Module 0:
00000000000111111111111000000000
Read leveling: 16+-5 16+-6 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000009s]  INFO(runtime): ARTIQ runtime starting...
[     0.003929s]  INFO(runtime): software ident 6.7666.20dc923c;illinoismaster
[     0.010820s]  INFO(runtime): gateware ident 6.7666.20dc923c;illinoismaster
[     0.017756s]  INFO(runtime): log level set to INFO by default
[     0.023456s]  INFO(runtime): UART log level set to INFO by default
[     0.141013s]  INFO(runtime::rtio_clocking): using internal RTIO clock
[     0.417419s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     2.804801s]  INFO(board_artiq::si5324):   ...locked
[     2.835759s]  INFO(runtime): network addresses: MAC=04-91-62-f1-ec-ea IPv4=192.168.1.75 IPv6-LL=fe80::691:62ff:fef1:ecea IPv6=no configured address
[     2.850827s]  WARN(board_artiq::drtio_routing): could not read routing table from configuration, using default
[     2.859608s]  INFO(board_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; }
[     2.888716s]  INFO(runtime::mgmt): management interface active
[     2.903047s]  INFO(runtime::session): accepting network sessions
[     2.919115s]  INFO(runtime::session): running startup kernel
[     2.923623s]  INFO(runtime::session): no startup kernel found
[     2.929817s]  INFO(runtime::session): no connection, starting idle kernel
[     2.936252s]  INFO(runtime::session): no idle kernel found
[     2.941652s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up

The only thing of note (only difference whatsoever, rather) is that the Si Lock occurs about 5 seconds faster than when restarting with flash.

Llriesebos · Sep 16, 2022

do you get any message in the log when the error occurs?

and were you able to replicate this error with some other gateware, for example standalone gateware?

Wwpc3 · Sep 16, 2022

lriesebos When the error starts, there isn't any communication from the crate which I can see from the UART or with Wireshark. The boot log is the first thing the crate communicates once the error "stops." I haven't yet had a way to try replicating the error on other gateware, but we do have another recently-obtained crate which I am hoping to use soon for this (currently iterating on getting AFWS to work with this crate). Thanks

Llriesebos · Sep 16, 2022

So basically you get this boot log, then if you wait until the error occurs no additional messages appear, and then the boot log again?

Does the reboot happen automatically or do you have to power cycle it?

Wwpc3 · Sep 21, 2022

The reboot would happen automatically and the error would persist across quick power cycles. The issue is now solved, actually: we noticed that the error was delayed in occurring if the crate was powered down for a long time, taking roughly 10 minutes following being powered for the error to come back. This was reproducible, so we assumed something was overheating; indeed, the fan on the Kasli board had stopped spinning. We removed the fan and used a small amount of WD-40 near the bearings, and after reinstalling the fan the error has since gone away. Really appreciate the help!

sb10q · Sep 22, 2022

There are known issues with the Radian Heatsink fans installed on many Kaslis - for systems purchased from M-Labs under warranty, we have replacement kits with higher-quality fans which hopefully should last much longer.
The WD-40 solution will probably only last a few days unfortunately (and with a proper lubricant some months).

Aandrewvh4 · Sep 22, 2022

@sb10q Is the issue resolved for Kaslis purchased more-recently? And if so, is there a certain date before which we should assume any purchased Kaslis have the poorer fan?