Also as explained in the instructions, always ensure free air circulation through (from below to above) each crate and/or add forced air cooling.
DRTIO disconnects
Jap, was a very good call, the fan is broken, it is indeed overheating...
Anything we have to pay attention to when exchanging it?
- Edited
This one tends to work fine and is usually very quiet: https://www.digikey.de/en/products/detail/sunon-fans/MF25101V1-1000U-G99/11312807 But as a quick hack it needs epoxy gluing.
c.f. also https://github.com/sinara-hw/Kasli/issues/102
Thanks for the quick help!
We replaced it with a 30mm we had lying around. It works fine again and is running around 65°C. I would like to monitor the temperature now to make sure it stays happy. When using artiq_flash I get the temperature reading only when I restart the Kasli (with artiq_flash start). Do you have the openocd commands somewhere to only get the status without interfering with the running system?
Is it also possible to read the temperature from the artiq_coremgnt command?
As mentioned above run artiq_flash
with -n
and remove the parts that you don't want.
coremgmt doesn't support reading the xadc.
Seems like i skimmed over that part, thanks, works now!
I was a bit too quick to call this off. Sadly it started disconnecting again. However this time the link is the only thing going down. The error symptoms are exactly the same except that on the satellite Kasli UART we can now see that it realizes that the connection is lost including the clock and it goes back to its internal clock. We need to restart both kaslis to regain teh connection again.
Can there be any damage from the overheating? Also the new fan is pulling a bit more current (0.07A instead of 0.05A). Can this cause and issue bringing the SFTP ports down?
I also wanted to read out the temperature of the kasli-soc, however the artiq-flash command seems not to be working with that. Any way to read out the temperature of that.
Since I reflashed the gateware on both of them we did not have any crashes anymore. However I still have not found a way to read out the temperature of the kasli-soc.
The issue came up again. The Master (Kasli-SoC) randomly looses connection after maybe a minute. The fan is working. The Kasli just gets unresponsive, looses network connection (can't be pinged) and the outputs are turned off. The log files are not very suggestive.
Master (the one which dies):
__________ __
/ ___/__ / / /
\__ \ / / / /
___/ / / /__/ /___
/____/ /____/_____/
(C) 2020-2022 M-Labs
[ 0.019995s] INFO(szl): Simple Zynq Loader starting...
[ 0.025210s] DEBUG(libboard_zynq::clocks::source): Set ARM_PLL to 2000000000 Hz
[ 0.007042s] DEBUG(libboard_zynq::clocks::source): Set IO_PLL to 1000000000 Hz
[ 0.016262s] DEBUG(libboard_zynq::clocks::source): Set DDR_PLL to 1066666666 Hz
[ 0.023612s] DEBUG(libboard_zynq::ddr): DDR 3x/2x clocks: 533333328/355555552
[ 0.030780s] DEBUG(libboard_zynq::ddr): DDR DCI clock: 10062892 Hz (divisors=2*53)
[ 0.042000s] DEBUG(libboard_zynq::sdio): Reset SDIO!
[ 0.046942s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 400000
[ 0.053890s] INFO(szl): Card inserted. Mounting file system.
[ 0.073372s] DEBUG(libboard_zynq::sdio): Changing clock frequency to 25000000
[ 0.080507s] DEBUG(libboard_zynq::sdio::sd_card): Getting bus width
[ 0.086880s] DEBUG(libboard_zynq::sdio::sd_card): 4 bit support
[ 0.092781s] DEBUG(libboard_zynq::sdio::sd_card): Changing bus width
[ 0.101001s] DEBUG(libboard_zynq::sdio): Set block size to 512
[ 0.107570s] DEBUG(libconfig::sd_reader): Partition ID: C
[ 0.113957s] INFO(szl): Loading gateware
[ 0.118317s] DEBUG(libconfig::bootgen): Partition header pointer = C80
[ 0.125279s] DEBUG(libconfig::bootgen): Unencrypted length = B9545
[ 0.131441s] DEBUG(libconfig::bootgen): Partition start address: B5D0
[ 0.498674s] DEBUG(libboard_zynq::devc): Invalidate DCache for bitstream buffer
[ 0.512039s] DEBUG(libboard_zynq::devc): Init preload FPGA
[ 0.517424s] DEBUG(libboard_zynq::devc): Toggling PROG_B
[ 0.546639s] DEBUG(libboard_zynq::devc): Waiting for done
[ 0.552015s] DEBUG(libboard_zynq::devc): Init postload FPGA
[ 0.557569s] INFO(szl): Loading runtime
[ 0.561924s] DEBUG(libconfig::bootgen): Partition header pointer = C80
[ 0.568800s] DEBUG(libconfig::bootgen): Unencrypted length = B00C
[ 0.574878s] DEBUG(libconfig::bootgen): Unencrypted length = 34B24
[ 0.580961s] DEBUG(libconfig::bootgen): Partition start address: C4B20
[ 0.693097s] INFO(szl): Preparing for runtime execution
[ 0.698735s] INFO(szl): executing payload
[ 0.000067s] INFO(runtime): NAR3/Zynq7000 starting...
[ 0.005237s] INFO(runtime): gateware ident: stuttgart3
[ 0.015633s] INFO(libboard_zynq::i2c): PCA9548 detected
[ 0.235501s] INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[ 0.626292s] INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[ 3.272125s] INFO(libboard_artiq::si5324): ...locked
[ 3.297994s] INFO(runtime::rtio_clocking): SYS CLK switched successfully
[ 3.309962s] INFO(libboard_zynq::i2c): PCA9548 detected
[ 3.345438s] INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-65-a7 IPv4=192.168.50.34 IPv6-LL=fe80::eaeb:1bff:fe13:65a7 IPv6: no configured address
[ 3.364081s] INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; }
[ 3.374270s] WARN(runtime::rtio_mgt): error reading device map (Configuration key `device_map` not found), device names will not be available in RTIO error messages
[ 3.391702s] INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[ 3.598995s] INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[ 7.391088s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[ 9.806982s] INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 31 packets
[ 9.894902s] INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[ 9.902920s] INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[ 9.909602s] INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
Satellite (seems to run fine, gets connected again if the the master is restarted) :
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017-2023 M-Labs Limited
Bootloader CRC passed
Gateware ident 8.0.beta;stuttgart
Initializing SDRAM...
Read leveling scan:
Module 1:
00000011111111110000000000000000
Module 0:
00000011111111110000000000000000
Read leveling: 10+-5 10+-5 done
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000009s] INFO(satman): ARTIQ satellite manager starting...
[ 0.005876s] INFO(satman): software ident 8.0.beta;stuttgart
[ 0.011613s] INFO(satman): gateware ident 8.0.beta;stuttgart
[ 0.148598s] INFO(satman): Clocking has already been set up.
[ 23.852252s] INFO(satman): uplink is up, switching to recovered clock
[ 23.885177s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 25.621044s] INFO(board_artiq::si5324): ...locked
[ 29.392139s] INFO(board_artiq::si5324::siphaser): calibration successful, lead: 280, width: 432 (347deg)
[ 29.902089s] INFO(satman): TSC loaded from uplink
[ 29.973207s] INFO(satman): rank: 1
[ 29.975302s] INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; }
[ 35.829363s] INFO(satman): uplink is down, switching to local oscillator clock
[ 35.863071s] INFO(board_artiq::si5324): waiting for Si5324 lock...
[ 37.679375s] INFO(board_artiq::si5324): ...locked
[ 37.682947s] ERROR(satman): received packet of an unknown type
Again we tried the usual, reflashing gateware, replacing the DRTIO cable, ... but to no avail.
What I found is, that if it is not connected via DRTIO this hang up seems not to happen.
Any ideas?
Update: I searched for a way to get more info out of the Kasli and used the local_run.sh
script in https://git.m-labs.hk/M-Labs/artiq-zynq/src/branch/master.
I get the following output:
./local_run.sh
Open On-Chip Debugger 0.11.0
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
Zynq CPU1.
Info : clock speed 1000 kHz
Info : JTAG tap: zynq.tap tap/device found: 0x1372c093 (mfg: 0x049 (Xilinx), part: 0x372c, ver: 0x1)
Info : JTAG tap: zynq.dap tap/device found: 0x4ba00477 (mfg: 0x23b (ARM Ltd), part: 0xba00, ver: 0x4)
Info : zynq.cpu.0: hardware has 6 breakpoints, 4 watchpoints
Info : zynq.cpu.1: hardware has 6 breakpoints, 4 watchpoints
Info : starting gdb server for zynq.cpu.0 on 3333
Info : Listening on port 3333 for gdb connections
Info : JTAG tap: zynq.tap tap/device found: 0x1372c093 (mfg: 0x049 (Xilinx), part: 0x372c, ver: 0x1)
Info : JTAG tap: zynq.dap tap/device found: 0x4ba00477 (mfg: 0x23b (ARM Ltd), part: 0xba00, ver: 0x4)
Warn : zynq.cpu.0: ran after reset and before halt ...
Warn : zynq.cpu.1: ran after reset and before halt ...
Error: timed out while waiting for target halted
TARGET: zynq.cpu.0 - Not halted
Afterwards the kasli is unresponsive again. I do not get any additional log output, even if the log level is set to TRACE