Tales of open80211s in the wild

One thing I like about working at cozybit is learning about clients’ novel uses of mesh technology. Today I’m going to tell you about one client’s use case and how we found a bug in the 802.11s mesh implementation while diagnosing their network performance.

The client has open80211s deployments of about a dozen nodes, including mobile video camera streaming 200 kbps unicast data to a viewing station. Simultaneously, the mesh supports push-to-talk broadcast voice data from any station. The camera is usually moving, so the topology is changing often. Since the data is high throughput, quickly achieving that topology is super important. They reported issues with data delivery and some strange behavior with path selection.

802.11 HWMP routing dictates the procedure for handling how paths from one node to another are determined and updated. In particular, there is a target sequence number to determine path freshness. A lower number indicates a staler path. During path selection, stale paths lose immediately and ties in path freshness are broken by best airtime link metric.

To resolve the observed problem with path selection, we dug into the packet captures they provided to analyze their traffic and gather stats on their mesh’s performance. Before I show you excerpts from packet captures, here is a diagram which illustrating the mesh topology at the time of the capture. Direct links between stations are represented by lines:

topology-blog

Here’s the excerpt from the packet capture which exposes the relevant parts of the packets. Here we have station 0, the target of a path request from station 3, sending path replies in turn to station 1, then station 5, and then station 9. The first sequence number, 10359, is the originating station target sequence number. The second, 155 in the reply to station 1, is the target station number.

sta−0: prep−>sta−1, orig=sta−3 (sn 10359), tgt=sta−0 (sn 155)
sta−0: prep−>sta−5, orig=sta−3 (sn 10359), tgt=sta−0 (sn 154)
sta−0: prep−>sta−9, orig=sta−3 (sn 10359), tgt=sta−0 (sn 154)

Even though station 0 sent replies to stations 5 and 9 after station 1, it used a lower target sequence number! The latter two path replies will never be considered, even if they are a better data path. Since station 0 is using 155 as the target sequence number in its reply to station 1, and the subsequent path replies are at least as fresh, it should use at least 155 in the replies to stations 5 and 9.

Here’s the fix (Thanks, Bob!), which ensures that when the target station is sending replies to the same originator, the sequence number never decrements:

Signed-off-by: Bob Copeland <bob@cozybit.com>
---
net/mac80211/mesh_hwmp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/mac80211/mesh_hwmp.c b/net/mac80211/mesh_hwmp.c
index 03ff5ea..94758b9 100644
--- a/net/mac80211/mesh_hwmp.c
+++ b/net/mac80211/mesh_hwmp.c
@@ -544,9 +544,10 @@ static void hwmp_preq_frame_process(struct ieee80211_sub_if_data *sdata,
if (time_after(jiffies, ifmsh->last_sn_update +
net_traversal_jiffies(sdata)) ||
time_before(jiffies, ifmsh->last_sn_update)) {
- target_sn = ++ifmsh->sn;
+ ++ifmsh->sn;
ifmsh->last_sn_update = jiffies;
}
+ target_sn = ifmsh->sn;
} else if (is_broadcast_ether_addr(target_addr) &&
(target_flags & IEEE80211_PREQ_TO_FLAG)) {
rcu_read_lock();

 

Another issue that we had to address was tuning path refresh time. The shorter the path refresh time, the quicker the paths can adapt, but you lose more network overhead to path request/response management frames. So, we worked with our client to tune the path refresh time based on real data on their mesh and achieve an acceptable balance, but we’ll save that discussion for a future post.

Leave a Reply

Your email address will not be published. Required fields are marked *