Since a few months ago, we are seeing sporadic MonInj related errors that result in outputs (i.e. TTL, DDS) not working. We are using ARTIQ 6 and a KC705 with custom gateware (bunch of TTL's and SPI channels) and the error appears at dashboard startup. Every time the dashboard starts, there is a probability the error is observed, resulting in non-functional device outputs. Every time the dashboard is (re-)started there seems to be a probability in having this error, and it is unknown to us what influences this probability. If the error does not appear, everything works fine. Restarting the master or rebooting the KC705 does not seem to consistently fix the problem. The device was not used for a while and the last time everything worked fine was November 2020 using ARTIQ 5 and gateware matching the ARTIQ version.

Based on our observations, we suspect that the MonInj error might cause (a subset of) device outputs to be overridden, resulting in non-functional device outputs. That might be caused by some error when setting up to MonInj connection, but this is pure speculation. We are confident it is not a problem caused by the network.

Does anyone else has seen this behavior? The error is shown below.

[nix-shell:~/red-chamber]$ artiq_dashboard 

(process:23064): Gtk-WARNING **: 16:41:29.782: Locale not supported by C library.
	Using the fallback 'C' locale.
Gtk-Message: 16:41:29.844: Failed to load module "canberra-gtk-module"
Gtk-Message: 16:41:29.845: Failed to load module "canberra-gtk-module"
qt.glx: qglx_findConfig: Failed to finding matching FBConfig for QSurfaceFormat(version 2.0, options QFlags<QSurfaceFormat::FormatOption>(), depthBufferSize -1, redBufferSize 1, greenBufferSize 1, blueBufferSize 1, alphaBufferSize -1, stencilBufferSize -1, samples -1, swapBehavior QSurfaceFormat::SingleBuffer, swapInterval 1, colorSpace QSurfaceFormat::DefaultColorSpace, profile  QSurfaceFormat::NoProfile)
No XVisualInfo for format QSurfaceFormat(version 2.0, options QFlags<QSurfaceFormat::FormatOption>(), depthBufferSize -1, redBufferSize 1, greenBufferSize 1, blueBufferSize 1, alphaBufferSize -1, stencilBufferSize -1, samples -1, swapBehavior QSurfaceFormat::SingleBuffer, swapInterval 1, colorSpace QSurfaceFormat::DefaultColorSpace, profile  QSurfaceFormat::NoProfile)
Falling back to using screens root_visual.
INFO:dashboard:root:ARTIQ dashboard 6.7602.ec4270fb connected to ::1
ERROR:dashboard:artiq.dashboard.moninj:lost connection to core device moninj
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/nix/store/35pv6fwf7xykj8dqym8p5rjp5jk0lxi6-python3-3.8.8-env/lib/python3.8/site-packages/artiq/dashboard/moninj.py", line 472, in stop
    await self.dm.close()
  File "/nix/store/35pv6fwf7xykj8dqym8p5rjp5jk0lxi6-python3-3.8.8-env/lib/python3.8/site-packages/artiq/dashboard/moninj.py", line 421, in close
    await asyncio.wait_for(self.core_connector_task, None)
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/asyncio/tasks.py", line 455, in wait_for
    return await fut
  File "/nix/store/35pv6fwf7xykj8dqym8p5rjp5jk0lxi6-python3-3.8.8-env/lib/python3.8/site-packages/artiq/dashboard/moninj.py", line 396, in core_connector
    await self.core_connection.close()
  File "/nix/store/35pv6fwf7xykj8dqym8p5rjp5jk0lxi6-python3-3.8.8-env/lib/python3.8/site-packages/artiq/coredevice/comm_moninj.py", line 54, in close
    await asyncio.wait_for(self._receive_task, None)
  File "/nix/store/qy5z9gcld7dljm4i5hj3z8a9l6p37y81-python3-3.8.8/lib/python3.8/asyncio/tasks.py", line 455, in wait_for
    return await fut
  File "/nix/store/35pv6fwf7xykj8dqym8p5rjp5jk0lxi6-python3-3.8.8-env/lib/python3.8/site-packages/artiq/coredevice/comm_moninj.py", line 91, in _receive_cr
    channel, override, value = struct.unpack(
struct.error: unpack requires a buffer of 6 bytes
a month later

sb10q , I have tried again with the latest gateware and ARTIQ (6) version, but the problem persists. Though I was able to occasionally get the error message firmware.runtime.moninj:moninj aborted: unexpected end of stream in addition to the original error message posted above.

I did took wireshark captures (two with the error, one without error), but I do not know enough about MonInj to make anything useful out of that. Would you mind taking a look to see if you find any clues? 10.236.50.91 is the KC705. Files can be downloaded here.

12 days later

@sb10q let me know if you have interest in looking into these captures. If not, we will probably go through the MonInj code ourselves.

I've just recently updated to ARTIQ 7 (from ARTIQ 4) and believe I am seeing this as well. Sporadically when running artiq_session i get the ERROR:dashboard:artiq.dashboard.moninj:lost connection to core device moninj. Attempting to click any of the buttons on the TTL panel then causes the dashboard to crash. I do not have a full error message to post because it has not happened recently and I am unsure what triggers it to happen when it does.

p.s. this is with a kasli as the core device.

5 days later

We think we have found the issue, but I am not able to resolve it myself. See https://github.com/m-labs/artiq/issues/1727 . We reproduced it both for KC705 and Kasli.

Currently we are working on mitigations. Options we are exploring are

  1. Removing the MonInj feature from gateware (we have not tested yet what other side effects that would have)
  2. Filtering MonInj connections such that they do not reach the coredevice. As long as we do not connect to MonInj, the gateware seems to work stable.

Once I have any updates regarding the mitigations, I will add it to the issue. So this thread can be considered as closed. Thanks for the feedback!

It's probably easier to disable the connection in the dashboard, it's just Python and nothing needs to be recompiled/flashed.