For all those little papers scattered across your desk
I stand for the Constitution, for due process, and for community that takes care of each other.
I’ve recently installed Gentoo on my new Framework Desktop, and for a while the network would hang, causing any program that touched the hardware to also crap out. Update same day: I’ve just had the problem recur! Agh!
Here are some symptoms:
pkill -9
: just dead in the water. (I hadn’t solved switching
consoles with Alt-Function keys yet, so fortunately I had tmux running.)
emerge
hang when they get to network steps.
Various ip
& ifconfig
style commands hang immediately.nmcli
and ping
don’t hang, but don’t work./etc/resolv.conf
isn’t being touched.NetworkManager
daemon is running, but:
top
shows “D” (d-sleep) and ps
shows “Dsl”rc-status
reports NetworkManager as active, but nmcli g
says its downrc-service NetworkManager restart
fails with
* Caching service dependencies ... [ ok ]
* Unmounting network filesystems ... [ ok ]
* Stoppping NetworkManager ...
* start-stop-daemon: 1 rocesses refused to stop [ !! ]
* ERROR: NetworkManager failed to stop
* Mounting network filesystems ... [ ok ]
s/restart/stop
, and pkill -9 NetworkManager
has no
effect!
zap
worked at the time, but it turns out not
to helpwpa_supplicant
is also in “D” status! And owned by init. Hm.grep -R wpa /var/logs
reminded me I needed to create
/etc/wpa_supplicant/wpa_supplicant.conf
. (I’ve done so since.)I checked my hardware with lspci
, which I’d fortunately installed during the
main system installation process, and with lspci -k
I found out my network
card is
c0:00.0 Mediatek MT7925 (RZ717) Wi-Fi 7 160MHz [14c3:0717]
with kernel module (in use) mt7925e
.
First, I tried turning off WiFi power saving as recommended by Garuda folks:
printf '%s\n' '[connection]' 'wifi.powersave = 0' | doas tee /etc/NetworkManager/conf.d/default-wifi-powersave-on.conf
That didn’t help: a little while into my next reboot, the hang happened again. Then I tried disabling the “Active State Power Management” for my network card, which is apparently cursed with this problem:
echo 'options mt7925e disable_aspm=1' | doas tee /etc/modprobe.d/7925e_wifi.conf
That seems to have done the trick, so I’ll delete the NetworkManager settings.
Since I’ve now seen this recur, I’ve checked a few more things (added
wpa_supplicant
details above). I’ve also added syslog
to the USE
flags for
NetworkManager to hopefully capture more information.
I don’t have kernel logs for the problematic scenario, but now that my system is running normally I can eliminate a few things from the problem space.
In both working and non-working configurations, I saw repeated logs in dmesg
for the network:
disconnect from AP <MAC> for new auth to <other MAC>
authenticate with <other MAC> (local address=<my MAC>)
send auth to <other MAC> (try 1/3)
authenticate with <other MAC> (local address=<my MAC>)
send auth to <2nd other MAC> (try 1/3)
authenticated
associate with <2nd other MAC> (try 1/3)
RX AssocResp from <2nd other MAC> (capab=0x1511 status=0 aid=3)
associated
Limiting TX power to 30 (30 - 0) dBm as advertised by <2nd other MAC>
diassociated from <2nd other MAC> (Reason: 1=UNSPECIFIED)
Often these occur every 5 minutes and loop for a while, then die away.
I saved some logs with
{ echo count from to;
grep 'disconnect from AP' /var/log/messages |
grep -o '\([[:xdigit:]][[:xdigit:]]:\?\)\{6\}' | paste - - |
sort | uniq -c; } | column -t
which gave for example
count from to
3 <prefix>:2f:a9:51 <prefix>:2f:a9:59
22 <prefix>:2f:a9:51 <prefix>:d2:14:29
1 <prefix>:d2:14:21 <prefix>:2f:a9:59
2 <prefix>:d2:14:21 <prefix>:d2:14:29
As far as I can tell, the (masked) <prefix>
there matches the output from
another connected device when checking the router with arp -a
:
gr6exx0c-a940.lan (192.168.2.1) at <prefix>:2f:a9:40 on en0 ifscope [ethernet]
And my management app says
<prefix>:2f:a9:41
<prefix>:d2:14:10
So clearly something is going on here, and the NetworkManager is having trouble deciding which connection to use?
dmesg -Hw
is much nicer than regular dmesg
shutdown -hP
didn’t power off, but reboot -p
did rebootreboot=acpi
is a valid kernel option in newer kernels, but not acpi=on
(force
is still valid, though unlikely to be needed)lspci -k >/dev/null
complains:
pcilib: Error reading /sys/bus/pci/devices/0000:00:08.3/label: Operation not permitted
but running with usual output doesn’t.
Once I saw dmesg logs about my SSD?
block nvme0n1: the capability attribute has been deprecated
No UUID available providing old NGUID