quamash._QEventLoop:Event callback failed

sb10q · Sep 13, 2022

ARTIQ version 5 is no longer supported and AFAIK this bug is likely fixed in newer major versions. Please upgrade.

Mmpr2 · Sep 19, 2022

Now running ARTIQ dashboard version 7.8123.3038639.
For some reason, when I run artiq_master and run the dashboard, the repo-scanner throws "asyncioipe accept failed" errors for most files. The behavior is not always the same. The error is thrown from a file in asyncio for processing windws events asynchronously that doesn't shed any light on the actual source of the error, at least to me. Here you can see what the logger reports after failing to load the "arguments_demo.py" file which is just a demonstration of how a dashboard script can be made to appear in the dashboard. If I trigger many repo-scans, it only works once in a while. I just got 3 successes out of 20 scans, run from the command prompt with "artiq_client scan-repository". I've trimmed my repo down to this single file just for testing purposes.

artiq.master.experiments:Skipping file 'arguments_demo.py'
Traceback (most recent call last):
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\site-packages\artiq\master\experiments.py", line 63, in scan
await self.process_file(entry_dict, root, filename)
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\site-packages\artiq\master\experiments.py", line 26, in process_file
description = await self.worker.examine(
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\site-packages\artiq\master\worker.py", line 313, in examine
await self.worker_action({"action": "examine", "file": file},
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\site-packages\artiq\master\worker.py", line 248, in worker_action
await self.send(obj)
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\site-packages\artiq\master\worker.py", line 173, in _send
f.result() # raise any exceptions
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\site-packages\sipyco\pipe_ipc.py", line 174, in drain
await self.writer.drain()
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\asyncio\streams.py", line 359, in drain
raise exc
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\asyncio\proactor_events.py", line 397, in loop_writing
self.write_fut = self.loop.proactor.send(self._sock, data)
File "C:\ProgramData\Miniconda3\envs\artiq-avon-new\lib\asyncio\windows_events.py", line 539, in send
ov.WriteFile(conn.fileno(), buf)
BrokenPipeError: [WinError 232] The pipe is being closed

sb10q · Sep 21, 2022

What python version did conda install?

sb10q · Sep 21, 2022

Please wipe the entire conda installation and install a fresh one from a recent anaconda installer.
Or better yet, install Linux.

Ddreens · Sep 22, 2022

conda installed python 3.10.4.

I'm also facing similar behavior in the older, previously stably running conda environment (software (5.7122.929b04da), this one uses python 3.5.6.). Here I don't get broken pipes when scanning the repo, only when attempting to submit experiments. I find that for many experiments, the first submit fails, but if I double click submit, the second submit is successful.

Ddreens · Sep 22, 2022

When I work to track down exactly where the error crops up in the artiq codebase, I'm finding it to be in the tools.py script, in the asyncio_wait_or_cancel function, which seems to basically be waiting for the results to come back through the pipe after they are put in earlier by the _send method of the worker class in worker.py in the master subfolder.

Does this suggest any particular action? Perhaps I could run some simpler examples of using sipyco's pipe_ipc class to send executable code across the local network to confirm that this networked python functionality is behaving properly on my system?

mpr2 basically just did a fresh conda environment to try out ARTIQ 7 so I doubt the wipe and reinstall will help, but I'll try and get to it soon. The behavior is also similar/persistent across two very different ARTIQ distributions and on two separate ARTIQ crates in my lab.

I'm hopeful about debugging this though. It seems like it has to relate to the python based networking that is going on between the dashboard and the master, possibly in the pipe_ipc file. Could this relate to the comments I read there about race conditions in the _child_connected method, pipe_ipc.py module?

sb10q · Sep 25, 2022

Do you have experiments that take a long time to scan? Could be a problem related to the timeout in Worker.examine.
Another possibility is the worker dying during the scan for some reason (can you get some logging on that perhaps?) and unexpectedly closing the pipe.

Mmpr2 · Sep 30, 2022

We found the broken pipe error to be related to firewall issues. but a new issue with ARTIQ-7 has emerged. When I start ARTIQ the dashboard loads without errors; it puts out the messages:

root:ARTIQ dashboard version: 7.8123.3038639
root:ARTIQ dashboard connected to moninj_proxy (::1)

However, when I try to run any scripts on the dashboard they begin running in the schedule and then just keep running indefinitely, without producing any errors. TTL GUI's are now completely unresponsive to clicking.

I've confirmed that the LED next to the ethernet port on our Kasli is lit, and I have succesfully pinged off of the crate's ethernet address. in order to check if there is some kind of silent connectivity error. I've also power cycled the crate and restarted the ARTIQ software. Any suggestions would be appreciated.

sb10q · Sep 30, 2022

mpr2 We found the broken pipe error to be related to firewall issues.

Could you document further what firewall was involved and how you fixed the problem?

Ddreens · Oct 6, 2022

In fact, not quite firewall related but related to our organization's security software, SentinelOne (S1), https://www.sentinelone.com/.

All we can say with certainty is that when this security software was disabled, or when it was enabled but with the user directory whitelisted, the broken pipes went away.

We tried whitelisting the ports that ARTIQ seems to be using, but this didn't have an effect. The directories were the relevant thing. I hypothesize that S1 was briefly write-locking the copies of scripts that ARTIQ places in the tmp directory (which is a subdirectory of the whitelisted user directory) and then feeds to the master when dashboard scripts are submitted. With these files locked, the asynchronous pipes that ARTIQ sets up for transferring the experiment code to the master might have broken. I'm guessing ARTIQ may require write/exec privileges to pass this code along, even though it seems like read would technically be sufficient.

Mmpr2 · Oct 6, 2022

Following in the wake of our resolving the broken pipe errors, I've found that the ARTIQ-7 we've downloaded seems to have issues communicating with our crate. As I mentioned earlier, all our GUI's are unresponsive and scripts that are submitted in the dashboard run indefinitely without doing anything. This goes even for simple test scripts, like LED.py. Our old version of ARTIQ can interact with the crate and we can ping off the crate's IP address.

I've noticed that in the release notes there is a breaking error that may affect communication with the core (in our case a Kasli). It mentions that the "target" param needs to be changed for our core device in our device_db file, but our current device_db file does not have any such param:

"core": {
"type": "local",
"module": "artiq.coredevice.core",
"class": "Core",
"arguments": {"host": core_addr, "ref_period": 1e-09}

I've tried adding a target param to no effect.

sb10q · Oct 8, 2022

mpr2 I've noticed that in the release notes there is a breaking error that may affect communication with the core

Wrong target param would cause an immediate error "not a shared library for current architecture" when you attempt to run a kernel. So not related.

Mmpr2 · Oct 10, 2022

Would a gateware mismatch issue result in a "mismatch between gateware" error running? We are not getting any errors, the commands simply run indefinitely without appearing to have any effect on the crate.

sb10q · Oct 11, 2022

Try the usual: artiq_run, check the core device logs, fix the mismatch warning if you have one, get a wireshark trace of the network activity.

Mmpr2 · Oct 12, 2022

I tried checking the core device logs by running artiq_coremgmt log, but that ended up running indefinitely and I had to quit out of it. I have captured a Wireshark (filtered to only look at message coming to and from our core device) of one of our scripts running, during which there seems to be communication between the core device and the computer:

I tried to upload the wireshark file as a .csv file, but it doesn't seem like the m-labs forum supports files of that format. I can email it to you if necessary.

sb10q · Oct 13, 2022

The format we need is pcap and please send it to helpdesk@

Mmpr2 · Oct 13, 2022

Done, thank you

Mmpr2 · Oct 18, 2022

Were you able to identify the problem from the Wireshark? The communication does seem different when the script runs indefinitely in ARTIQ-7 from when the script runs properly in ARTIQ-5, but I can't untangle the issue.