I had issues with a Kasli 2.0 (configured as a satellite) crashing randomly after operating it for a few days. The master (a Kasli 1) suddenly lost the link (which caused "aux_packet" errors in the dashboard) to the satellite and reestablished it after ~ 10 minutes. After some investigation, I found out that the satellite Kasli is much hotter (>120°C) compared to the master Kasli (~ 75 °C). I read out the temperature via artiq_flash, but I modified the scripts that are called, so that the respective Kasli is not restarted (verified via the serial monitor).

This is the temperature graph of the satellite over a couple of hours. You can see, that the temperature is pretty high in general and that there are spikes and drops of the temperature. Red circles indicate an initialization of all devices on the satellite and blue circle indicate a crash of the satellite.

There is a space of one full rack units above and below the ARTIQ crate and the back is completely open. By adding a relatively large fan on top, I was able to reduce the temperature enough that no crashes happen, but it is still above 100 °C.

What is the expected space around the crate with a Kasli 2, so that the temperature is in a range around 70°C? Why is the Kasli 2 so much hotter?

2 months later

Hi steine, I'm also interested in temperature monitoring my Kasli 2.0 (one of ours suffered a catastrophic failure). May I know how you logged the temperature out using artiq_flash without restarting the Kasli?

5 months later

@mdklee Did you ever figure this out? Hoping to also monitor kasli temp without restarts

You can do this by just running the xadc_report OpenOCD command, as artiq_flash also does internally. We have a script that logs these temperatures to InfluxDB here:

https://github.com/OxfordIonTrapGroup/oxart-devices/blob/master/oxart/frontend/log_kasli_health.py

On Linux, you can combine this with systemd timers to easily get periodic logging: Create a service file that runs the above Python script with the proper arguments for your case, and then create a .timer unit with the same name and e.g. containing:

$ cat log-kasli-health.timer
[Unit]
Description=Runs log_kasli_health every five minutes

[Timer]
OnBootSec=1m
OnUnitActiveSec=5m

[Install]
WantedBy=timers.target
7 months later

Hi there, is there a way to do this to a Kasli-SOC? Since most of the coremgmt is done using different functions for the Artiq-zynq, I couldn't find a way by reading the documentation alone.