The issue came up again. The Master (Kasli-SoC) randomly looses connection after maybe a minute. The fan is working. The Kasli just gets unresponsive, looses network connection (can't be pinged) and the outputs are turned off. The log files are not very suggestive.
Master (the one which dies):
__________ __
/ ___/__ / / /
\__ \ / / / /
___/ / / /__/ /___
/____/ /____/_____/
(C) 2020-2022 M-Labs
[ 0.019995s] INFO(szl): Simple Zynq Loader starting...
[ 0.025210s] DEBUG(libboard_zynq::clocks::source): Set ARM_PLL to 2000000000 Hz
[ 0.007042s] DEBUG(libboard_zynq::clocks::source): Set IO_PLL to 1000000000 Hz
[ 0.016262s] DEBUG(libboard_zynq::clocks::source): Set DDR_PLL to 1066666666 Hz
[ 0.023612s] DEBUG(libboard_zynq::ddr): DDR 3x/2x clocks: 533333328/355555552
[ 0.030780s] DEBUG(libboard_zynq::ddr): DDR DCI clock: 10062892 Hz (divisors=2*53)
[ 0.042000s] DEBUG(libboard_zynq::sdio): Reset SDIO!
[ 0.046942s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 400000
[ 0.053890s] INFO(szl): Card inserted. Mounting file system.
[ 0.073372s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 25000000
[ 0.080507s] DEBUG(libboard_zynq::sdio::sd_card): Getting bus width
[ 0.086880s] DEBUG(libboard_zynq::sdio::sd_card): 4 bit support
[ 0.092781s] DEBUG(libboard_zynq::sdio::sd_card): Changing bus width
[ 0.101001s] DEBUG(libboard_zynq::sdio): Set block size to 512
[ 0.107570s] DEBUG(libconfig::sd_reader): Partition ID: C
[ 0.113957s] INFO(szl): Loading gateware
[ 0.118317s] DEBUG(libconfig::bootgen): Partition header pointer = C80
[ 0.125279s] DEBUG(libconfig::bootgen): Unencrypted length = B9545
[ 0.131441s] DEBUG(libconfig::bootgen): Partition start address: B5D0
[ 0.498674s] DEBUG(libboard_zynq::devc): Invalidate DCache for bitstream buffer
[ 0.512039s] DEBUG(libboard_zynq::devc): Init preload FPGA
[ 0.517424s] DEBUG(libboard_zynq::devc): Toggling PROG_B
[ 0.546639s] DEBUG(libboard_zynq::devc): Waiting for done
[ 0.552015s] DEBUG(libboard_zynq::devc): Init postload FPGA
[ 0.557569s] INFO(szl): Loading runtime
[ 0.561924s] DEBUG(libconfig::bootgen): Partition header pointer = C80
[ 0.568800s] DEBUG(libconfig::bootgen): Unencrypted length = B00C
[ 0.574878s] DEBUG(libconfig::bootgen): Unencrypted length = 34B24
[ 0.580961s] DEBUG(libconfig::bootgen): Partition start address: C4B20
[ 0.693097s] INFO(szl): Preparing for runtime execution
[ 0.698735s] INFO(szl): executing payload
[ 0.000067s] INFO(runtime): NAR3/Zynq7000 starting...
[ 0.005237s] INFO(runtime): gateware ident: stuttgart3
[ 0.015633s] INFO(libboard_zynq::i2c): PCA9548 detected
[ 0.235501s] INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[ 0.626292s] INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[ 3.272125s] INFO(libboard_artiq::si5324): ...locked
[ 3.297994s] INFO(runtime::rtio_clocking): SYS CLK switched successfully
[ 3.309962s] INFO(libboard_zynq::i2c): PCA9548 detected
[ 3.345438s] INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-65-a7 IPv4=192.168.50.34 IPv6-LL=fe80::eaeb:1bff:fe13:65a7 IPv6: no configured address
[ 3.364081s] INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; }
[ 3.374270s] WARN(runtime::rtio_mgt): error reading device map (Configuration key `device_map` not found), device names will not be available in RTIO error messages
[ 3.391702s] INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[ 3.598995s] INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[ 7.391088s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[ 9.806982s] INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 31 packets
[ 9.894902s] INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[ 9.902920s] INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[ 9.909602s] INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
Satellite (seems to run fine, gets connected again if the the master is restarted) :
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017-2023 M-Labs Limited
Bootloader CRC passed
Gateware ident 8.0.beta;stuttgart
Initializing SDRAM...
Read leveling scan:
Module 1:
00000011111111110000000000000000
Module 0:
00000011111111110000000000000000
Read leveling: 10+-5 10+-5 done
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000009s] INFO(satman): ARTIQ satellite manager starting...
[ 0.005876s] INFO(satman): software ident 8.0.beta;stuttgart
[ 0.011613s] INFO(satman): gateware ident 8.0.beta;stuttgart
[ 0.148598s] INFO(satman): Clocking has already been set up.
[ 23.852252s] INFO(satman): uplink is up, switching to recovered clock
[ 23.885177s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 25.621044s] INFO(board_artiq::si5324): ...locked
[ 29.392139s] INFO(board_artiq::si5324::siphaser): calibration successful, lead: 280, width: 432 (347deg)
[ 29.902089s] INFO(satman): TSC loaded from uplink
[ 29.973207s] INFO(satman): rank: 1
[ 29.975302s] INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; }
[ 35.829363s] INFO(satman): uplink is down, switching to local oscillator clock
[ 35.863071s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 37.679375s] INFO(board_artiq::si5324): ...locked
[ 37.682947s] ERROR(satman): received packet of an unknown type
Again we tried the usual, reflashing gateware, replacing the DRTIO cable, ... but to no avail.
What I found is, that if it is not connected via DRTIO this hang up seems not to happen.
Any ideas?