Giter Site home page Giter Site logo

Broken 5G signal connections about mt76 HOT 34 CLOSED

openwrt avatar openwrt commented on August 20, 2024
Broken 5G signal connections

from mt76.

Comments (34)

pepe2k avatar pepe2k commented on August 20, 2024

I can confirm that, not only for 5 GHz. It looks definitely strange - nothing in log, connection is up (client is connected with AP), but transmission between the router and client just dies - can't even ping both sides. After some time or reconnect, everything is working again.

Confirmed on SAP-G3200U3.

Is there a way to debug this driver?

Cheers,
Piotr

from mt76.

sodz avatar sodz commented on August 20, 2024

I found that the tramsmission between my router (with MT7612) and iPhones breaks intermittently. However, the connection between my router and my laptop (Atheros WNIC) is quite stable. What is the model of the wireless NIC of your clients?

When the tramsmission between my router and my iPhone dies, I captured 802.11 frames transmitted over the air using a third device. It showed that both ends could send and receive 802.11 frames properly during that period, and I can see that frames sent by both sides were ACK'ed by the other side. It looks like as if the iPhone is dropping all the packets on the layer 3 during the outage. I am wondering whether this is a bug of the driver or of client devices.

from mt76.

jiyifeng avatar jiyifeng commented on August 20, 2024

my nic is intel 6300AGN.
lenovo newifi(MT7620 inside)
same as pepe2k.
Absolutly. it is MT76 driver's bug

from mt76.

pepe2k avatar pepe2k commented on August 20, 2024

@sodz, @jiyifeng

Tested/confirmed on:

  • Galaxy S5
  • Broadcom BCM94360CD
  • Intel Dual Band Wireless-AC 7260
  • TP-Link TL-WDN4800

from mt76.

sodz avatar sodz commented on August 20, 2024

@pepe2k @jiyifeng
I have just ran some tests on Lenovo newifi (mt7612e, latest mt76 driver) and two clients: a macbook with bcm43xx, and a tablet with marvell avastar WNIC. On both clients, iperf and ping tests were cariied out simultaneously for about 1 hour. It turned out the packet loss rate was neglectable, and there were no noticeable outages.

from mt76.

pepe2k avatar pepe2k commented on August 20, 2024

@sodz

I will make deeper tests later, at this moment I'm using drivers from MTK, without any problems.

from mt76.

qiuzi avatar qiuzi commented on August 20, 2024

2.4g no problem same as @sodz but 5g same as @pepe2k

from mt76.

sodz avatar sodz commented on August 20, 2024

@pepe2k
ok, now I can confirm that with real-world traffic, the transmission dies randomly. no idea why it did not happen when i tested using iperf.

from mt76.

sodz avatar sodz commented on August 20, 2024

I think I have identified this issue with A-MPDU. After disabling it (by commenting out line 795 of init.c, "/* ieee80211_hw_set(hw, AMPDU_AGGREGATION); */" the communication becomes stable now, although at a performance loss. @pepe2k could you please try this workaround out and see if you could confirm that this is related to AMPDU?

from mt76.

sodz avatar sodz commented on August 20, 2024

Here are some packets I captured, just when the transmission died:
https://www.cloudshark.org/captures/93482862e765
Apparently there are anomalies, but I am not familiar with 802.11 MAC and I don't know what actually went wrong.

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

please try the latest version

from mt76.

sodz avatar sodz commented on August 20, 2024

@nbd168
just tried. but it didn't fix this issue.

from mt76.

qiuzi avatar qiuzi commented on August 20, 2024

@sodz Problem solved?

from mt76.

airend avatar airend commented on August 20, 2024

Thanks for your suggestion, @sodz; I think it's indeed related to frame aggregation. At least, after commenting

mt76/init.c

Line 813 in 17c5b83

ieee80211_hw_set(hw, AMPDU_AGGREGATION);
, clients don't have intermittent timeouts when connecting to WAN hosts. The odd part is that LAN connections seemed fine, and all WAN connections were OK on the router, so for a long time I assumed it's some sort of bridging problem. Then again, the 2.4 GHz link, using a different driver/radio was always fine…

Another observation is that these timeouts get much worse when increasing channel width (VHT20->40->80), and are more correlated to throughput and amount of data transferred. Maybe these increase the probability of some sort of buggy frame aggregation event. Either way, things are OK after disabling this feature, and performance doesn't seem to have taken a significant hit.

Update: performance does take a major hit with 11ac/mobile clients (maybe because all 11ac frames are supposed to be MPDUs?). I used to be able to saturate a 30 Mbps connection, versus 11-12 now… @LorenzoBianconi, does this happen to you as well?

By the way, @nbd168, I noticed that MAX-A-MPDU-LEN-EXP is always forced to zero. I'm probably reading the iw phy output wrong, but it seems like MAX-A-MPDU-LEN-EXP3 should be supported. I'm mentioning this in case there's a conflict between mac80211.sh and the way mt76 is reporting capabilities.

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

Please test if the current version still has this issue

from mt76.

airend avatar airend commented on August 20, 2024

Seems better now (0a47c46); no timeouts for roughly fifteen minutes, and bandwidth was back to normal on my Nexus 5, but then it happened again… I uploaded the package here, in case anyone else wants to test. Thanks, @LorenzoBianconi and @nbd168, for working on this!

Update: the timeouts are a lot more random now, and happen more rarely, but still very frustrating. No obvious errors on either router, or clients… I haven't done a proper git bisection, but I went as far back as July 6th (d1a6945), and timeouts still happen, plus much reduced bandwidth.

Update2: Same observations with 659530a (updated package here). To reiterate, local connectivity (existing links, ping, etc) is maintained, but WAN stops working for a few minutes, with no obvious pattern. All goes back to normal when ieee80211_hw_set(hw, AMPDU_AGGREGATION) is disabled. I wonder whether compat-wireless-2015-07-21 has anything to do with it… Also, pings are very consistent on the router, but very erratic over Wi-Fi. Testing compat-wireless-2015-08-03 now.

from mt76.

qiuzi avatar qiuzi commented on August 20, 2024

Problem still not solved

from mt76.

airend avatar airend commented on August 20, 2024

Hey @nbd168, just a crazy idea… Since NAT/TCP seems to be involved somehow, do you think GRO or the generic segmentation offloading might cause this issue with large MPDUs? I'll try to play with ethtool, since nothing else worked so far :-(

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

I don't think this has anything to do with GRO or similar things, because this is all abstracted away by the network stack.
Please try the latest version (committed in OpenWrt trunk r47063), I found some more aggregation related bugs

from mt76.

airend avatar airend commented on August 20, 2024

Thanks again for your tireless efforts, @nbd168. Unfortunately, I'm the bearer of bad news yet again. Things have actually gotten worse after 9e972d5; now, even moderate network loads trigger timeouts. They happen more quickly, and recovery takes longer, or doesn't happen at all. I have a few HT clients, and one VHT (Nexus 5). As before, disabling A-MPDU will fix the issue, but then everything slows down a lot (5-6 Mbps). Here are a few things I noticed so far:

  • Only WAN connections timeout, which is the weirdest part; the SSH link to the router is always OK, and no logged errors whatsoever.
  • Every combination of software behaves more or less the same: compat-wireless 09-16, hostapd 2.5 just released, etc (ditto latest stable/Chaos Calmer).
  • When disabling A-MPDU, things have gotten worse after 08-28 (b6de6a0), although that probably doesn't matter since aggregation is a core feature.
  • Things are much better with a dumb AP setup, behind a Linksys E1500 router. An HT40 client (Ralink) works quite well now, while the Nexus 5 still timeouts. This happens regardless of htmode (HT40, VHT40, or VHT80).

I even worried about segmentation/offloading, MTUs (MSS clamping), etc, but it can't be those as you pointed out (the mt7602 radio never has this issue, after all). I wish I knew more about mac80211 and the network stack, but I'm glued to any development here ;-)

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

Can you use another device to capture all packets in monitor mode before and during the hangs?
If so, please make the AP run in HT20 mode to ensure that the monitor mode capture is as reliable as possible.

from mt76.

airend avatar airend commented on August 20, 2024

On VHT20, it takes slightly longer for the connection to break (probably, because it's slower), but here's the raw capture after things go haywire. I also uploaded the file on CloudShark here, if it helps.

The LG STA is a Nexus 5 (supposedly, single stream 11ac). I don't know much about this, but lots of fragmentation errors, malformed packets, etc happening. All in all, not good things…

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

What kind of device did you use to capture? Also, can you please do another capture in HT20 (not VHT20) mode? That should make capturing data packets (which I need) more reliable.

from mt76.

airend avatar airend commented on August 20, 2024

Data were collected with the builtin card in my Macbook Air (BCM43xx in sniffing/monitor mode). I don't think I have a better setup readily available. At any rate, I switched to HT20 (channel 44), and uploaded more PCAPs in that Box folder.

  • The good file captures the short period when things seem to work (also here on CloudShark).
  • The bad1 file was captured after the link broke following a speed test (also here). Apparently, on HT20, the timeouts seem to recover pretty quickly, and towards the end of the capture, simple browsing started working, albeit not very well.
  • The bad2 file was captured after the link recovered, and I decided to stress it with another speed test, when it breaks again (also here).

I was a bit hasty with my previous comment; those damaged packets were neighborhood noise, and for the purpose of these tests, my only active STA on that channel is the LG Nexus 5 (BCM4339). As you can see, I'm very keen on fixing this ;-) and much appreciate your work.

from mt76.

airend avatar airend commented on August 20, 2024

Progress, I think ;-) I was trying to make sense of the information in those captured packets, and based on my limited understanding, it sounded like power saving might be involved. No luck with UAPSD and WMM as possible culprits, but I noticed the not-so-benign changes in bca9b7c. Since my issues got worse as more BARs were sent, I reverted the relevant commits, and things seem OK now.

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

Please try the latest git version without the BAR related reverts. It's good to know that the BAR frames are triggering the issue, but I still need to understand why.

from mt76.

airend avatar airend commented on August 20, 2024

It would be really great to understand why, especially that other bugs may lurk, either in this driver, or in the SoC stuff. I always revert everything with your newest commits. Testing d4900fc on top of
openwrt-mirror/openwrt@73edad2 yielded the same timeouts, but I just noticed a couple of interesting RX buffer fixes (e.g., openwrt-mirror/openwrt@73edad2, openwrt-mirror/openwrt@966bec6). I'm currently testing your latest changes on top of openwrt-mirror/openwrt@a6900bd.

What still baffles me most is why are these timeouts so much worse when mt7620 does normal routing, versus simple bridging behind another WAN router… Fix openwrt-mirror/openwrt@118b711 for mt7621 is intriguing; do you think we have similar issues with mt7620? Also, would @blogic be able to chime in on this very stubborn issue?

P.S. On latest everything, my one 11ac client seems to behave better, but the other 11n clients still suffer periodic timeouts.

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

Found another bug that would mess up BAR transmissions. Please try trunk r47142 with latest mt76

from mt76.

airend avatar airend commented on August 20, 2024

Seems, dare I say, fixed ;-) I think your latest mac80211 patch (openwrt-mirror/openwrt@bdeb166) was the keystone to all this craziness, maybe? Either way, do you think it'd be a good idea to consolidate some of the aggregation-related flags? For example, a valid mtxq->agg_ssn implies mtxq->aggr true, or maybe I'm misunderstanding some of these.

Otherwise, just curious, are we doing a lot more in software than other mac80211 drivers, so we need to be more careful about tracking and sending BARs when aggregating frames?

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

I don't see a good way to consolidate the flags. agg_ssn just tracks the last used sequence number during an aggregation session (we could store it outside of aggregation as well, but it's not needed then). We can't tell from the value whether it is valid or not (0 is a valid value as well), so we can't easily get rid of mtxq->aggr.

In terms of doing things in software vs hardware, there are two main classes of devices: those having aggregation handling in software and those having it in hardware.
With ath9k, the software controls everything related to aggregation: sequence number assignment, forming aggregates, selecting rate retry table for each full aggregate.
With ath10k, iwlwifi, etc. the firmware handles all these things, the software only does the protocol handshake.

mt76 is somewhere inbetween. Sequence number assignment is handled in software, aggregating frames together into A-MPDUs is handled in hardware.
This hardware design is actually not very pleasant to deal with, because it makes it necessary for software to deal with all kinds of stuff, yet it does not give the software enough control to do it well. The driver does not get a reliable tx status or aggregation feedback, so it cannot know which frames exactly a client received and what its receiver aggregation window looks like. Because of this, the driver needs to do stupid things like send BAR frames on station PS wakeups.

In terms of driver complexity, mt76 is a lot simpler than pretty much any comparable driver that allows for software aggregation control. This is made possible by two things:

  1. I added a layer of abstraction that allows mac80211 to control per-station per-TID queues that the driver can pull from, reducing driver complexity, and massively reducing bufferbloat in the driver.
  2. I wrote mt76 completely from scratch, free from all the insanities of typical vendor written code :)

Either way, I'll mark this ticket as fixed now. Thanks a lot for testing! Feel free to reopen if issues re-occur.

from mt76.

airend avatar airend commented on August 20, 2024

Thanks so much, @nbd168, not only for all the good work, but also for taking the time to explain how things work. This is great information!

from mt76.

zb87 avatar zb87 commented on August 20, 2024

Hi @nbd168, thanks for your great work. However, I find mt76 is still not stable for some client devices (i.e., iPhone 6 Plus).

I am using Lenovo Y1. I have 3 client devices that support 802.11ac. mt76 works very well on my PC (with Intel 7260 AC) and Tablet (Nexus 9). However, it does not work well on the iPhone 6 Plus (with iOS 9.0.2). Sometimes suddenly all transmissions timeout, while the wifi is still shown as connected on both the router and the iPhone. Manually turn off / turn on wifi on iPhone will make work again. The link can be also recovered automatically after a few minutes. Please tell me if you need any additional information

FYI, I am using Openwrt Chaos Calmer 15.05 stable, with the latest version of mt76 0169cab. To make mt76 work in Chaos Calmer, I've reverted d1a6945 and removed the IEEE80211_HW_SUPPORT_FAST_XMIT flag in init.c.

from mt76.

nbd168 avatar nbd168 commented on August 20, 2024

@zb87, the latest fix that I made was in mac80211, not mt76 directly. I have already pushed the relevant fix into the Chaos Calmer Branch and updated mt76 to the latest version there.
I have also pushed a hostapd fix that might help with stability on iOS devices.
Please try the latest version of the branch as-is to see if it's more stable for you.

from mt76.

zb87 avatar zb87 commented on August 20, 2024

I've tried the new Chaos Calmer branch, everything is working well. Thanks for your nice work @nbd168.

from mt76.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.