Hi,

I am a grad student at the University of Illinois, working in Brian DeMarco's trapped ion group. We have been using ARTIQ while setting up our experiment for about six months and have recently had a connection error cause many (roughly half) of our experiment runs to fail. The primary error is "TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond" (full trackback below) and appears to arise spontaneously, inhibit the crate connection for about a minute, and then cease. Monitoring on PuTTY, the boot log is posted when the error "ends," and Wireshark indicates no communications with the crate during this time.

Indications that the error is occurring include the above error being thrown when trying to run a script, regardless of its content. Our scripts involve simple DDS and TTL pulse sequences and setting DAC values, and we have encountered "random" DDS and TTL channels turning off mid-sequence as another indication of the error.

I have tried disconnecting everything from our crate and computer except the ethernet communication and USB-UART lines, changing the ethernet cable, and changing the computer's ethernet port without being able to identify the cause.

Error trackback:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\Scripts\artiq_run-script.py", line 9, in <module>
sys.exit(main())
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\frontend\artiq_run.py", line 224, in main
return run(with_file=True)
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\frontend\artiq_run.py", line 210, in run
raise exn
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\frontend\artiq_run.py", line 203, in run
exp_inst.run()
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\language\core.py", line 54, in run_on_core
return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\core.py", line 132, in run
self.comm.check_system_info()
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm_kernel.py", line 342, in check_system_info
self._write_empty(Request.SystemInfo)
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm_kernel.py", line 310, in _write_empty
self._write_header(ty)
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm_kernel.py", line 302, in _write_header
self.open()
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm_kernel.py", line 188, in open
self.socket = initialize_connection(self.host, self.port)
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\site-packages\artiq\coredevice\comm.py", line 25, in initialize_connection
sock = socket.create_connection((host, port))
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\socket.py", line 808, in create_connection
raise err
File "C:\ProgramData\Anaconda3\envs\artiq-6-illinois\lib\socket.py", line 796, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

Any help would be greatly appreciated!
Thank you,
Will

Hi, yes this is a known issue, but it for me it always required a power cycle to get ARTIQ back working. For now, check out the corresponding issue: https://github.com/m-labs/artiq/issues/1759
I still have to investigate it more closely, but it looks like setting up a dedicated network for the Kasli helps (instead of using a switch in a public network) and for some reason since I commented out the line 142 in C:\Users\ybion443\.conda\envs\artiq-7-1\Lib\site-packages\artiq\firmware\runtime\main.rs (smoltcp::phy::EthernetTracer::new(net_device, net_trace_fn)) I haven't had this error again. This seems to be the offending part, but I do not yet see, why a recompilation of ARTIQ with that line commented out is not necessary to actually have an effect, so I have not updated the issue yet.

  • wpc3 replied to this.

    steine01 Thanks for the response. Unfortunately, I didn't find commenting that line out to fix the issue. We're also on a private network with the ethernet cable connecting directly from the computer to the crate, so I will need to investigate our particular issue further. I am using ARTIQ-6, so I might want to try your trick on ARTIQ-7 before anything.

    Hi @wpc3 , Leon here from Duke University. I saw your email to Ken, but will reply using this forum to have the discussion public.

    I read through the issue, and I do not think I have seen something like that before, but I'm happy to think along. I was wondering, did you eventually managed to get the UART output from the Kasli over the USB connection? If so, I would like to see those messages.

    • wpc3 replied to this.

      lriesebos Hi Leon,

      Thanks so much. Monitoring the UART with PuTTY informed us of when the crate began communicating and was able to run scripts again (typically for only a minute before the error returned), and the output in that case was the typical boot log:

      __  __ _ ____         ____
      |  \/  (_) ___|  ___  / ___|
      | |\/| | \___ \ / _ \| |
      | |  | | |___) | (_) | |___
      |_|  |_|_|____/ \___/ \____|
      
      MiSoC Bootloader
      Copyright (c) 2017-2021 M-Labs Limited
      
      Bootloader CRC passed
      Gateware ident 6.7666.20dc923c;illinoismaster
      Initializing SDRAM...
      Read leveling scan:
      Module 1:
      00000000000111111111110000000000
      Module 0:
      00000000000111111111111000000000
      Read leveling: 16+-5 16+-6 done
      SDRAM initialized
      Memory test passed
      
      Booting from flash...
      Starting firmware.
      [     0.000009s]  INFO(runtime): ARTIQ runtime starting...
      [     0.003929s]  INFO(runtime): software ident 6.7666.20dc923c;illinoismaster
      [     0.010820s]  INFO(runtime): gateware ident 6.7666.20dc923c;illinoismaster
      [     0.017756s]  INFO(runtime): log level set to INFO by default
      [     0.023456s]  INFO(runtime): UART log level set to INFO by default
      [     0.141013s]  INFO(runtime::rtio_clocking): using internal RTIO clock
      [     0.417419s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
      [     2.804801s]  INFO(board_artiq::si5324):   ...locked
      [     2.835759s]  INFO(runtime): network addresses: MAC=04-91-62-f1-ec-ea IPv4=192.168.1.75 IPv6-LL=fe80::691:62ff:fef1:ecea IPv6=no configured address
      [     2.850827s]  WARN(board_artiq::drtio_routing): could not read routing table from configuration, using default
      [     2.859608s]  INFO(board_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; }
      [     2.888716s]  INFO(runtime::mgmt): management interface active
      [     2.903047s]  INFO(runtime::session): accepting network sessions
      [     2.919115s]  INFO(runtime::session): running startup kernel
      [     2.923623s]  INFO(runtime::session): no startup kernel found
      [     2.929817s]  INFO(runtime::session): no connection, starting idle kernel
      [     2.936252s]  INFO(runtime::session): no idle kernel found
      [     2.941652s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up

      The only thing of note (only difference whatsoever, rather) is that the Si Lock occurs about 5 seconds faster than when restarting with flash.

      do you get any message in the log when the error occurs?

      and were you able to replicate this error with some other gateware, for example standalone gateware?

      • wpc3 replied to this.

        lriesebos When the error starts, there isn't any communication from the crate which I can see from the UART or with Wireshark. The boot log is the first thing the crate communicates once the error "stops." I haven't yet had a way to try replicating the error on other gateware, but we do have another recently-obtained crate which I am hoping to use soon for this (currently iterating on getting AFWS to work with this crate). Thanks

        So basically you get this boot log, then if you wait until the error occurs no additional messages appear, and then the boot log again?

        Does the reboot happen automatically or do you have to power cycle it?

        5 days later

        The reboot would happen automatically and the error would persist across quick power cycles. The issue is now solved, actually: we noticed that the error was delayed in occurring if the crate was powered down for a long time, taking roughly 10 minutes following being powered for the error to come back. This was reproducible, so we assumed something was overheating; indeed, the fan on the Kasli board had stopped spinning. We removed the fan and used a small amount of WD-40 near the bearings, and after reinstalling the fan the error has since gone away. Really appreciate the help!

        There are known issues with the Radian Heatsink fans installed on many Kaslis - for systems purchased from M-Labs under warranty, we have replacement kits with higher-quality fans which hopefully should last much longer.
        The WD-40 solution will probably only last a few days unfortunately (and with a proper lubricant some months).

        @sb10q Is the issue resolved for Kaslis purchased more-recently? And if so, is there a certain date before which we should assume any purchased Kaslis have the poorer fan?