We have a Master-Sattelite setup which was working well for the last couple of months. Since yesterday the DRTIO conenction is lost after several minutes.
Master log:

[  2763.642555s]  INFO(runtime::kernel::core1): kernel finished
[  2766.334496s]  INFO(runtime::rtio_mgt::drtio): [LINK#2] link is down
[  2766.340750s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is down

The Master just looses connection and doesn't try to reconnect.
Satellite log:

...
[     6.435115s]  INFO(satman): TSC loaded from uplink                                                                                                                                                                                          [     6.452586s]  INFO(satman): rank: 1
[     6.454679s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 3 0; }

The Satellite comes up normally and does not print anything over UART when the disconnect happens. The only thing which helps is to reboot the Satellite kasli. The master starts to reestablish a conenction then.
The two are connected via the recommended fiber link.
Things we already tried:

  • reflashed gateware of both kaslis
  • Tried a different Power supply for both
  • tried a copper DRTIO cable
  • tried different SFP ports

Any ideas on what the problem can be? The fact that it appeared suddenly suggests that it could be some hardware failure. Any things to check for?

What would be safe limits? Can I log that somehow without restarting?

artiq_flash outputs the temperatures. Don't know from the top of my head if you can let it do nothing else. It's a wrapper around openocd and outputs the openocd script with -n. Use that (the xadc_report xc7.tap is the relevant command).

Also as explained in the instructions, always ensure free air circulation through (from below to above) each crate and/or add forced air cooling.

Jap, was a very good call, the fan is broken, it is indeed overheating...
Anything we have to pay attention to when exchanging it?

Thanks for the quick help!
We replaced it with a 30mm we had lying around. It works fine again and is running around 65°C. I would like to monitor the temperature now to make sure it stays happy. When using artiq_flash I get the temperature reading only when I restart the Kasli (with artiq_flash start). Do you have the openocd commands somewhere to only get the status without interfering with the running system?
Is it also possible to read the temperature from the artiq_coremgnt command?

As mentioned above run artiq_flash with -n and remove the parts that you don't want.
coremgmt doesn't support reading the xadc.

Seems like i skimmed over that part, thanks, works now!

I was a bit too quick to call this off. Sadly it started disconnecting again. However this time the link is the only thing going down. The error symptoms are exactly the same except that on the satellite Kasli UART we can now see that it realizes that the connection is lost including the clock and it goes back to its internal clock. We need to restart both kaslis to regain teh connection again.
Can there be any damage from the overheating? Also the new fan is pulling a bit more current (0.07A instead of 0.05A). Can this cause and issue bringing the SFTP ports down?
I also wanted to read out the temperature of the kasli-soc, however the artiq-flash command seems not to be working with that. Any way to read out the temperature of that.

    4 days later

    choelzl Also the new fan is pulling a bit more current (0.07A instead of 0.05A). Can this cause and issue bringing the SFTP ports down?

    No.

    Since I reflashed the gateware on both of them we did not have any crashes anymore. However I still have not found a way to read out the temperature of the kasli-soc.

    7 days later

    The issue came up again. The Master (Kasli-SoC) randomly looses connection after maybe a minute. The fan is working. The Kasli just gets unresponsive, looses network connection (can't be pinged) and the outputs are turned off. The log files are not very suggestive.
    Master (the one which dies):

    
                         __________   __
                        / ___/__  /  / /
                        \__ \  / /  / /
                       ___/ / / /__/ /___
                      /____/ /____/_____/
    
                     (C) 2020-2022 M-Labs
    
    [     0.019995s]  INFO(szl): Simple Zynq Loader starting...
    [     0.025210s] DEBUG(libboard_zynq::clocks::source): Set ARM_PLL to 2000000000 Hz
    [     0.007042s] DEBUG(libboard_zynq::clocks::source): Set IO_PLL to 1000000000 Hz
    [     0.016262s] DEBUG(libboard_zynq::clocks::source): Set DDR_PLL to 1066666666 Hz
    [     0.023612s] DEBUG(libboard_zynq::ddr): DDR 3x/2x clocks: 533333328/355555552
    [     0.030780s] DEBUG(libboard_zynq::ddr): DDR DCI clock: 10062892 Hz (divisors=2*53)
    [     0.042000s] DEBUG(libboard_zynq::sdio): Reset SDIO!
    [     0.046942s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 400000
    [     0.053890s]  INFO(szl): Card inserted. Mounting file system.
    [     0.073372s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 25000000
    [     0.080507s] DEBUG(libboard_zynq::sdio::sd_card): Getting bus width
    [     0.086880s] DEBUG(libboard_zynq::sdio::sd_card): 4 bit support
    [     0.092781s] DEBUG(libboard_zynq::sdio::sd_card): Changing bus width
    [     0.101001s] DEBUG(libboard_zynq::sdio): Set block size to 512
    [     0.107570s] DEBUG(libconfig::sd_reader): Partition ID: C
    [     0.113957s]  INFO(szl): Loading gateware
    [     0.118317s] DEBUG(libconfig::bootgen): Partition header pointer = C80
    [     0.125279s] DEBUG(libconfig::bootgen): Unencrypted length = B9545
    [     0.131441s] DEBUG(libconfig::bootgen): Partition start address: B5D0
    [     0.498674s] DEBUG(libboard_zynq::devc): Invalidate DCache for bitstream buffer
    [     0.512039s] DEBUG(libboard_zynq::devc): Init preload FPGA
    [     0.517424s] DEBUG(libboard_zynq::devc): Toggling PROG_B
    [     0.546639s] DEBUG(libboard_zynq::devc): Waiting for done
    [     0.552015s] DEBUG(libboard_zynq::devc): Init postload FPGA
    [     0.557569s]  INFO(szl): Loading runtime
    [     0.561924s] DEBUG(libconfig::bootgen): Partition header pointer = C80
    [     0.568800s] DEBUG(libconfig::bootgen): Unencrypted length = B00C
    [     0.574878s] DEBUG(libconfig::bootgen): Unencrypted length = 34B24
    [     0.580961s] DEBUG(libconfig::bootgen): Partition start address: C4B20
    [     0.693097s]  INFO(szl): Preparing for runtime execution
    [     0.698735s]  INFO(szl): executing payload
    [     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
    [     0.005237s]  INFO(runtime): gateware ident: stuttgart3
    [     0.015633s]  INFO(libboard_zynq::i2c): PCA9548 detected
    [     0.235501s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
    [     0.626292s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
    [     3.272125s]  INFO(libboard_artiq::si5324):   ...locked
    [     3.297994s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
    [     3.309962s]  INFO(libboard_zynq::i2c): PCA9548 detected
    [     3.345438s]  INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-65-a7 IPv4=192.168.50.34 IPv6-LL=fe80::eaeb:1bff:fe13:65a7 IPv6: no configured address
    [     3.364081s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; }
    [     3.374270s]  WARN(runtime::rtio_mgt): error reading device map (Configuration key `device_map` not found), device names will not be available in RTIO error messages
    [     3.391702s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
    [     3.598995s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
    [     7.391088s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
    [     9.806982s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 31 packets
    [     9.894902s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
    [     9.902920s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
    [     9.909602s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128

    Satellite (seems to run fine, gets connected again if the the master is restarted) :

    
     __  __ _ ____         ____ 
    |  \/  (_) ___|  ___  / ___|
    | |\/| | \___ \ / _ \| |    
    | |  | | |___) | (_) | |___ 
    |_|  |_|_|____/ \___/ \____|
    
    MiSoC Bootloader
    Copyright (c) 2017-2023 M-Labs Limited
    
    Bootloader CRC passed
    Gateware ident 8.0.beta;stuttgart
    Initializing SDRAM...
    Read leveling scan:
    Module 1:
    00000011111111110000000000000000
    Module 0:
    00000011111111110000000000000000
    Read leveling: 10+-5 10+-5 done
    SDRAM initialized
    Memory test passed
    
    Booting from flash...
    Starting firmware.
    [     0.000009s]  INFO(satman): ARTIQ satellite manager starting...
    [     0.005876s]  INFO(satman): software ident 8.0.beta;stuttgart
    [     0.011613s]  INFO(satman): gateware ident 8.0.beta;stuttgart
    [     0.148598s]  INFO(satman): Clocking has already been set up.
    [    23.852252s]  INFO(satman): uplink is up, switching to recovered clock
    [    23.885177s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
    [    25.621044s]  INFO(board_artiq::si5324):   ...locked
    [    29.392139s]  INFO(board_artiq::si5324::siphaser): calibration successful, lead: 280, width: 432 (347deg)
    [    29.902089s]  INFO(satman): TSC loaded from uplink
    [    29.973207s]  INFO(satman): rank: 1
    [    29.975302s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; }
    [    35.829363s]  INFO(satman): uplink is down, switching to local oscillator clock
    [    35.863071s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
    [    37.679375s]  INFO(board_artiq::si5324):   ...locked
    [    37.682947s] ERROR(satman): received packet of an unknown type

    Again we tried the usual, reflashing gateware, replacing the DRTIO cable, ... but to no avail.
    What I found is, that if it is not connected via DRTIO this hang up seems not to happen.
    Any ideas?

    Update: I searched for a way to get more info out of the Kasli and used the local_run.sh script in https://git.m-labs.hk/M-Labs/artiq-zynq/src/branch/master.
    I get the following output:

    ./local_run.sh 
    Open On-Chip Debugger 0.11.0
    Licensed under GNU GPL v2
    For bug reports, read
            http://openocd.org/doc/doxygen/bugs.html
    Zynq CPU1.
    Info : clock speed 1000 kHz
    Info : JTAG tap: zynq.tap tap/device found: 0x1372c093 (mfg: 0x049 (Xilinx), part: 0x372c, ver: 0x1)
    Info : JTAG tap: zynq.dap tap/device found: 0x4ba00477 (mfg: 0x23b (ARM Ltd), part: 0xba00, ver: 0x4)
    Info : zynq.cpu.0: hardware has 6 breakpoints, 4 watchpoints
    Info : zynq.cpu.1: hardware has 6 breakpoints, 4 watchpoints
    Info : starting gdb server for zynq.cpu.0 on 3333
    Info : Listening on port 3333 for gdb connections
    Info : JTAG tap: zynq.tap tap/device found: 0x1372c093 (mfg: 0x049 (Xilinx), part: 0x372c, ver: 0x1)
    Info : JTAG tap: zynq.dap tap/device found: 0x4ba00477 (mfg: 0x23b (ARM Ltd), part: 0xba00, ver: 0x4)
    Warn : zynq.cpu.0: ran after reset and before halt ...
    Warn : zynq.cpu.1: ran after reset and before halt ...
    Error: timed out while waiting for target halted
    TARGET: zynq.cpu.0 - Not halted

    Afterwards the kasli is unresponsive again. I do not get any additional log output, even if the log level is set to TRACE

      5 months later

      choelzl Did you resolve this? Might have a similar problem :/

      25 days later

      choelzl Afterwards the kasli is unresponsive again.

      This looks like Kasli-SoC and not Kasli.

      You need to pulse POR with the Python script in zynq-rs before using JTAG, otherwise this and other problems occur. This also requires the corresponding jumpers to be installed on Kasli-SoC.