Speed up DNSDist with AF_XDP

Chinese readers can read the Chinese version of this article.
中文读者可阅读本文的中文版本

Preface

DNSDist is an excellent DNS load balancer, and AF_XDP is an emerging high-performance Linux asynchronous IO interface that benefits from eBPF.
It is a great honor for Y7n05h to participate in the AF_XDP transformation of DNSDist as a contributor.

It’s an honor to have Y7n05h as a contributor to improve DNSDist with AF_XDP.

The changes to the UDP part of DNSDist have long since come to an end. This performance-improving modification requires profiling data to validate.

So, let’s start the fun performance analysis.

Test environment information

Laptop
OS: ArchLinux
Kernel Version: 5.19.1-arch2-1
CPU: AMD Ryzen 7 4800H with Radeon Graphics
MEM: DDR4 64G
NIC: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
DNSPerf Version: 2.9.0
GCC Version: 12.1.1
Libxdp Version:1.2.5

PC1
OS: ArchLinux
Kernel Version: 5.19.1-arch2-1
CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
MEM: DDR4 8G
NIC: Intel Corporation Wi-Fi 6 AX200 (rev 1a)

PC2
OS: Ubuntu
Kernel Version: 5.15.0-46-generic
CPU: 12th Gen Intel(R) Core(TM) i7-12700KF
MEM: DDR4 64G
NIC: Broadcom Inc. and subsidiaries BCM4360 802.11ac Wireless Network Adapter (rev 03)

Thanks to @yegetables for lending PC2 to Y7n05h.

Source Code:
DNSDist With AF_XDP(AF_XDP verion): https://github.com/Y7n05h/pdns/commit/d42e356a48a433a9f4efae9c3dd648101a37abdf
DNSDist Without AF_XDP(Normal version): https://github.com/Y7n05h/pdns/commit/f5e76c2a6932ec4360d38219fb515d26d538b40d

In this test, Y7n05h uses Laptop to generate the requests for testing, PC1 to run the DNSDist instance to be tested, and PC2 to run the SmartDNS service.
Laptop, PC1 and PC2 are all connected to the gateway (192.168.30.1) using WIFI, and Laptop, PC1 and PC2 access each other via WIFI. Unfortunately, the network used in this test environment was still used by numerous other devices during the test, so Y7n05h cannot exclude experimental errors caused by fluctuations in the network environment.

The test tool uses DNSPerf. The test resolves A records for 12018 different domains, only once for each domain in each test.

For the tests, SmartDNS was used as the DNS server. There is no particular reason to use SmartDNS, except that Y7n05h is familiar with it and it is easy to build and deploy SmartDNS on Ubuntu.

During the test, Y7n05h used Laptop to send DNS query requests to PC1 running DNSDist. SmartDNS then concurrently sends the DNS requests to the 4 DNS servers 1.1.1.1, 1.0.0.1, 8.8.8.8, 119.29.29.29 (if necessary) and replies to DNSDist with the first response received.

In this test, Y7n05h sent a lot of DNS query requests to the above 4 public DNS servers frequently due to testing needs, Y7n05h expresses sincere thanks to Cloudflare, Google, DNSPod for providing these public DNS services. (Although Y7n05h performed a large number of DNS queries during the test, these queries are cached by DNSDist, SmartDNS, and not every query sends a request to the above servers. (Therefore Y7n05h feels this is acceptable, which is fundamentally different from DDoS.) Also, in order to eliminate as much interference as possible from the DNS service’s cache for this test, Y7n05h has used DNSPerf to repeatedly send resolution requests to SmartDNS for all domains used in the test before the start of this test.

In the current network environment, Y7n05h’s subjective guess is that DNS query requests are still dominated by A records and AAAA records. Since Y7n05h’s environment does not have good IPv6 support, the resolution of A records was used as a performance indicator in this test. Since Y7n05h cannot verify the correctness of the DNS resolution results, we do not consider correctness as a metric in this performance analysis.

It should also be noted that the DNSDist configuration used in this test has been simplified for testing purposes and may differ significantly from the DNSDist configuration in the production environment. Therefore, this test by Y7n05h does not fully reflect the performance of AF_XDP’s optimization of DNSDist in a production environment.

As we all know, the DNS protocol uses recursive resolution, and the response time of requests is greatly affected by whether the query domain hits the DNS server’s cache or not.

In summary, this test by Y7n05h may be biased and may be inaccurate. The following is only the opinion of Y7n05h.

Performance Tests

The uniq.txt contains and only contains 12018 non-repeating domains.

In this article, the horizontal axis of all lines is the number of runs, and the number of DNSPerf runs is incremented by 1 for each DNSPerf run, the result of multiple DNSPerf executions for the same DNSDist process instance on the same curve. The points on the same curve are listed in chronological order from left to right. Any two DNSPerf’s do not overlap.

To simplify the exposition of this paper, Y7n05h hereby agrees with the reader that

  • The version of DNSDist that uses AF_XDP is omitted as “AF_XDP version”.
  • DNSDist version without AF_XDP is omitted as “Normal version”.

Test 1

The following command was used in this test to run DNSPerf:

1
dnsperf -s 192.168.30.170 -p 5300 -d uniq.txt

In test 1, Y7n05h ran the AF_XDP version first, then the Normal version, and finally the AF_XDP version again.

Looking at the average latency first, there is a decreasing trend in the average latency regardless of the fold. The decreasing average latency is generally due to the fact that after multiple DNSPerf executions, SmartDNS and the downstream servers have increased their hit rate for the domain name caches involved in the DNSPerf in this test. This is evidenced by the fact that the average latency of “re-running AF_XDP version” after “running Normal version” is significantly lower than the previous test results. Based on the data available at this time, it is not possible to determine the impact of AF_XDP on average latency.

For the metric of query loss, the AF_XDP version is significantly lower than the Normal version. The number of lost queries with the AF_XDP version tends to increase slowly with the number of tests.

For the average number of queries per second, or throughput, using the AF_XDP version is significantly better than the Normal version. Considering that the fetch time of the data in the blue curve is between the green and red curves, the effect of caching on the AF_XDP version is that it enhances the green curve and degrades the red curve. This is a strong enough comparison for the non-AF_XDP version of DNSDist with the blue curve to show the throughput advantage of AF_XDP.

The runtime is the time consumed for a complete execution of DNSPerf. The conclusion here is also that the AF_XDP version outperforms the Normal version. The conclusions here are similar to those from the throughput analysis, and Y7n05h does not repeat them.

Even considering the caching impact on DNS in queries such as DNSDist, SmartDNS, etc. results in DNSPerf speeding up in the time dimension one by one. What can be clearly concluded at this point is that AF_XDP significantly improves DNSDist’s throughput in the current scenario. The impact on query latency may still require further testing.

Test 2

The following command was used in this test to run DNSPerf:

1
dnsperf -s 192.168.30.170 -p 5300 -d uniq.txt -c 500 -T 16

In Test 2, the concurrency of the test was increased by adding command line arguments. Test 2 ran the Normal version first and the AF_XDP version second.

Note: Test 2 was not run consecutively with Test 1, which may have affected the DNS cache.

The average latency of the AF_XDP version is still decreasing and is not significantly different from the Normal version in the last two tests. y7n05h I personally guess that if we increase the number of tests, the average latency of the AF_XDP version may be lower than the Normal version as the cache command increases. The average significant latency of the AF_XDP version was higher than the Normal version in the first 3 tests, perhaps because the cache in DNSDist was cleared by stopping the Normal version and running the AF_XDP version. the effect of AF_XDP on the average latency still needs further testing.

Comparing the query misses for the Normal and AF_XDP versions, they remain similar to those in Test 1. There is also no significant change in the number of queries lost compared to Test 1.

In terms of throughput, the gap between the AF_XDP version and the Normal version increases further for more concurrent query requests, and tends to increase with the number of tests.

The AF_XDP version is significantly less time consuming than the Normal version for one DNSPerf execution. This is similar to what was found in Test 1.

Summary

AF_XDP significantly improves DNSDist throughput, but risks increasing the average latency per request (which needs to be further verified).

In terms of throughput alone, it is conservatively estimated that AF_XDP can more than double the throughput of DNSDist.

From the tests here, it appears that AF_XDP is a technique that has the potential to significantly improve the throughput of UDP-based web services.

用 AF_XDP 加速 DNSDist

英文读者可阅读本文的 英文版本
English readers can read the English version of this article.

前言

DNSDist 是一个优秀的 DNS 负载均衡器,AF_XDP 则是得益于 eBPF 而产生的新兴的高性能 Linux 异步 IO 接口。
很荣幸 Y7n05h 能作为一个贡献者,参与 DNSDist 的 AF_XDP 改造。

目前,对 DNSDist 的 UDP 部分的改造早已告一段落。这种意在提高性能的修改成果当然不该只是纸上谈兵。
收集压测数据,进行性能分析才是最有说服力的成绩单。

那么,便开始有趣的压测吧。

压测环境

Laptop
OS: ArchLinux
Kernel Version: 5.19.1-arch2-1
CPU: AMD Ryzen 7 4800H with Radeon Graphics
MEM: DDR4 64G
NIC: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
DNSPerf Version: 2.9.0
GCC Version: 12.1.1
Libxdp Version:1.2.5
压测过程中 Laptop 的 CPU、MEM、NIC 始终保持着低负载。

PC1
OS: ArchLinux
Kernel Version: 5.19.1-arch2-1
CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
MEM: DDR4 8G
NIC: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
压测过程中 PC1 除必要的系统进程外,仅运行了 DNSDist。

PC2
OS: Ubuntu
Kernel Version: 5.15.0-46-generic
CPU: 12th Gen Intel(R) Core(TM) i7-12700KF
MEM: DDR4 64G
NIC: Broadcom Inc. and subsidiaries BCM4360 802.11ac Wireless Network Adapter (rev 03)
压测过程中 PC2 的 CPU、MEM、NIC 始终保持着低负载。

感谢 @yegetables 提供的 PC2 ,让 Y7n05h 有了稳定运行的 SmartDNS 服务。这对本次压测有很大的帮助。

源码信息:
DNSDist With AF_XDP: https://github.com/Y7n05h/pdns/commit/d42e356a48a433a9f4efae9c3dd648101a37abdf
DNSDist Without AF_XDP: https://github.com/Y7n05h/pdns/commit/f5e76c2a6932ec4360d38219fb515d26d538b40d

本次测试中,Y7n05h 使用 Laptop 下发压测流量,使用 PC1 运行待测试的 DNSDist 实例,使用 PC2 运行 SmartDNS 服务。
Laptop、PC1、PC2 均使用 WIFI 连接至网关(192.168.30.1)。Laptop、PC1、PC2 通过 WIFI 相互访问。遗憾的是这次测试环境中使用的网络在测试过程中仍然被众多其他设备使用,因此 Y7n05h 也无法排除由网络环境波动导致的实验误差。

测试工具使用 DNSPerf。测试中解析 12018 个不同域名的 A 记录,每次测试中每个域名仅解析一次。

测试中,使用 SmartDNS 作为 DNS 服务器。使用 SmartDNS 并没有什么特别的原因,只是因为 Y7n05h 很熟悉它并且在 Ubuntu 上构建并部署 SmartDNS 很方便。

测试过程中,Y7n05h 使用 Laptop 向运行 DNSDist 的 PC1 发送压测请求,PC1 上运行的 DNSDist 解析并处理 DNS 请求后,发送给运行 SmartDNS 的 PC2(如果有必要的话),SmartDNS 则在收到 DNS 请求后并发的发送给 1.1.1.11.0.0.18.8.8.8119.29.29.29 这 4 个 DNS 服务器(如果有必要的话),并将最先接收到的响应回复给 DNSDist。

在本次测试中,因测试需要,Y7n05h 频繁的向上述 4 个公共 DNS 服务器发送了大量的 DNS 请求,Y7n05h 对 Cloudflare、Google、DNSPod 提供的这些公共 DNS 服务表示真心的感谢。(Y7n05h 虽然在测试中进行了大量的 DNS 查询,但这些查询会被 DNSDist、SmartDNS 缓存,并不是每次查询都会向上述服务器发送请求。因此 Y7n05h 自认为本次测试中的做法并未超出合理限度。)同时,为了尽最大可能的排除 DNS 服务的缓存为本次测试造成的干扰,在本次测试开始前,Y7n05h 已使用 DNSPerf 反复多次向 SmartDNS 发送在测试中使用的所有域名的解析请求。

目前的网络环境中,Y7n05h 主观猜测 DNS 请求仍旧以 A 记录 和 AAAA 记录为主,且 Y7n05h 所在的环境中 IPv6 支持并不好。因此 Y7n05h 在这次测试中就以 A 记录的解析能力作为衡量性能的指标。另一方面,Y7n05h 也无法验证 DNS 的解析结果的正确性,因此 Y7n05h 在这次性能分析中不考虑正确性这一指标。

还需要指明的是,这次测试中使用的 DNSDist 配置为了测试方便做出了众多简化,和生产环境中对 DNSDist 的配置可能有较大差异。因此 Y7n05h 的这次测试并不能完全反应 AF_XDP 对 DNSDist 的优化在生产环境中的表现。

众所周知,DNS 协议使用递归的解析方式,请求的响应时间极大的受查询域名是否命中 DNS 服务器的缓存影响。

综上所述,Y7n05h 的这次测试可能是有失偏颇的,可能是不精确的。以下内容仅代表 Y7n05h 观点。

压力测试

uniq.txt 中包含且仅包含 12018 个不重复的域名。

在本文中,所有折线的横轴均为运行的次数,每运行一次 DNSPerf 次数递增 1,同一条曲线对同一 DNSDist 进程实例多次执行 DNSPerf 的结果。同一条曲线上的点,按照时间顺序由左向右排列。任意两次 DNSPerf 均不重叠。

为简化本文论述,Y7n05h 在此与读者约定:

  • 使用 AF_XDP 的 DNSDist 版本省略为「AF_XDP 版」
  • 不使用 AF_XDP 的 DNSDist 版本省略为「Normal 版」

测试1

本次测试中使用如下命令运行 DNSPerf:

1
dnsperf -s 192.168.30.170 -p 5300 -d uniq.txt

在测试 1 中,Y7n05h 先运行了 AF_XDP 版本,然后运行 Normal 版本,最后再次运行 AF_XDP 版。

先看平均延迟,总体来看,无论哪条折线,平均延迟均呈递减趋势。平均延迟呈递减趋势的原因大致为多次执行 DNSPerf 后,SmartDNS 和上游服务器对本次测试中 DNSPerf 涉及的域名缓存命中率逐次提高所致。这一点在「运行完 Normal 版」后,「重新运行 AF_XDP 版」的平均延迟显著低于先前的测试成绩得到印证。根据目前现有的数据,无法判断 AF_XDP 本身对平均延迟的影响

对查询丢失这一指标而言,AF_XDP 版本显著低于正常版。使用 AF_XDP 版本的查询丢失数量随测试次数增加有缓慢递增趋势。

对平均每秒查询数量,或者说吞吐量来说,使用 AF_XDP 版本显著优于 Normal 版本。考虑蓝色折线中的数据的获取时间介于绿色折线和红色折线,缓存对 AF_XDP 版本的影响是增强了绿色曲线而劣化了红色曲线,故此,若以蓝色曲线执行时的缓存情况作为基准,使用 AF_XDP 的 DNSDist 的真实吞吐量大致引介于红色曲线和绿色曲线之间,这对蓝色曲线的不使用 AF_XDP 版本的 DNSDist 而言,有足够强的对比,足够体现 AF_XDP 的吞吐量优势。

运行时间为完整执行一次 DNSPerf 所消耗的时间。这里的结论也是 AF_XDP 版本优于 Normal 版本。这里的结论和从吞吐量分析得来的结论类似,Y7n05h 不再赘述。

即使考虑到 DNSDist、SmartDNS 等查询中对 DNS 的缓存影响导致 DNSPerf 在时间维度上逐次加速。目前能明确得出结论的是,AF_XDP 能显著提升 DNSDist 在当前场景下的吞吐量。对查询延迟的影响,可能仍然需要进一步测试。

测试2

1
dnsperf -s 192.168.30.170 -p 5300 -d uniq.txt -c 500 -T 16

在测试 2 中,通过添加命令行参数,提高了测试的并发量。测试2 中先运行了 Normal 版本,后运行了 AF_XDP 的版本。

注意:测试2 与 测试1 并不连续进行,这可能影响了 DNS 的缓存。

还是先看平均延迟,可以看到后运行的 AF_XDP 版本的延迟仍然呈现递减趋势,在最后的两次测试中与 Normal 版本并无显著差异。Y7n05h 个人猜测若增加测试次数,随着缓存命令中的提高,AF_XDP 版本的平均延迟或将低于 Normal 版。前 3 次测试中,AF_XDP 版的平均显著延迟高于 Normal 版,或许是停止 Normal 版并运行 AF_XDP 版,导致 DNSDist 中的缓存被清除所致。AF_XDP 对平均延迟的影响,仍需进一步测试。

对比 Normal 版和 AF_XDP 版的查询丢失情况,两者仍保持了与 测试1 中相似的情况。且查询丢失数量与 测试1 相比也无明显变化。

就吞吐量而言,加大并发量的查询请求中,AF_XDP 版与 Normal 版的差距进一步增大,且有随测试次数增加继续增大差距的趋势。

对执行一次 DNSPerf 的耗时而言,AF_XDP 版明显低于 Normal 版。这与 测试1 中得到的结论相似。

总结

AF_XDP 能显著提高 DNSDist 的吞吐量,但有提高平均每次请求的延迟的风险(这还需要进一步验证)。

仅就吞吐量而言,保守估计 AF_XDP 能提升 DNSDist 的吞吐量一倍以上。

从这里的测试来看, AF_XDP 这项技术有可能明显提升基于 UDP 的网络服务的吞吐量。

LinuxKernel-list.h 源码不完全分析

有一段时间没认真写博客了,没能一直坚持着,实在让 Y7n05h 感到惭愧,所以今天写出本文也算是补救一下吧.

info
License
本文引用了部分来自 Linux Kernel 的源码,源码取自 LinuxKernel v2.6.34 基于 GPLv2

list.h 源码分析

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/*
* Simple doubly linked list implementation.
*
* Some of the internal functions ("__xxx") are useful when
* manipulating whole lists rather than single entries, as
* sometimes we already know the next/prev entries and we can
* generate better code by using them directly rather than
* using the generic single-entry routines.
*/

struct list_head {
struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name) \
struct list_head name = LIST_HEAD_INIT(name)

static inline void INIT_LIST_HEAD(struct list_head *list)
{
list->next = list;
list->prev = list;
}

这里是链表的核心结构,实现双向循环链表的初始化.

1
2
3
4
5
6
7
8
9
static inline void __list_add(struct list_head *new,
struct list_head *prev,
struct list_head *next)
{
next->prev = new;
new->next = next;
new->prev = prev;
prev->next = new;
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/**
* list_add - add a new entry
* @new: new entry to be added
* @head: list head to add it after
*
* Insert a new entry after the specified head.
* This is good for implementing stacks.
*/
static inline void list_add(struct list_head *new, struct list_head *head)
{
__list_add(new, head, head->next);
}


/**
* list_add_tail - add a new entry
* @new: new entry to be added
* @head: list head to add it before
*
* Insert a new entry before the specified head.
* This is useful for implementing queues.
*/
static inline void list_add_tail(struct list_head *new, struct list_head *head)
{
__list_add(new, head->prev, head);
}

关于插入也没什么需要过度解释的,唯一想说说的是 inline 的使用消除了函数调用的开销,当然代价是内核大小的增大,但我想这点代价是值得的.
当然,这里对 __list_add() 的复用和对两种不同的插入方式的抽象是十分精彩的.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/*
* Delete a list entry by making the prev/next entries
* point to each other.
*
* This is only for internal list manipulation where we know
* the prev/next entries already!
*/
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
next->prev = prev;
prev->next = next;
}

/**
* list_del - deletes entry from list.
* @entry: the element to delete from the list.
* Note: list_empty() on entry does not return true after this, the entry is
* in an undefined state.
*/
#ifndef CONFIG_DEBUG_LIST
static inline void list_del(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}
#else
extern void list_del(struct list_head *entry);
#endif

关于这里,也没什么能产生太大疑惑的地方,唯一要好奇的可能是为什么要把被删除的链表节点的指针置为 LIST_POISON1LIST_POISON2
在用户态编程的时候,开发者们常把无效的指针置为 NULL 防止出现 Use After Free(UAF) 等问题的出现,一旦访问置为 NULL 的指针就能通过 Segment fault 得知发生了错误.但别忘了,Segment fault的检查是由内核完成的,在内核态编程时,自然是无法使用的.因此这里使用这两个特殊的地址触发分页保护告知开发者出现内存错误.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/*
* Architectures might want to move the poison pointer offset
* into some well-recognized area such as 0xdead000000000000,
* that is also not mappable by user-space exploits:
*/
#ifdef CONFIG_ILLEGAL_POINTER_VALUE
# define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL)
#else
# define POISON_POINTER_DELTA 0
#endif

/*
* These are non-NULL pointers that will result in page faults
* under normal circumstances, used to verify that nobody uses
* non-initialized list entries.
*/
#define LIST_POISON1 ((void *) 0x00100100 + POISON_POINTER_DELTA)
#define LIST_POISON2 ((void *) 0x00200200 + POISON_POINTER_DELTA)

剩下的部分虽然也有很多内容,但都比较简单,相对来说也是易于理解的,Y7n05h 在这里就不赘述了.

这个宏函数还是很有趣的,能看到里面有很多 GNU 对 C 语言的扩展语法.直接从定义中看明白这个宏的用法是略有困难的,参考这个宏的用例将有助于理解.

1
2
3
4
5
6
7
8
/**
* list_entry - get the struct for this entry
* @ptr: the &struct list_head pointer.
* @type: the type of the struct this is embedded in.
* @member: the name of the list_struct within the struct.
*/
#define list_entry(ptr, type, member) \
container_of(ptr, type, member)
1
2
3
4
5
6
7
8
9
10
/**
* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
*
*/
#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})

用例:

1
2
3
4
5
static inline struct nfs_page *
nfs_list_entry(struct list_head *head)
{
return list_entry(head, struct nfs_page, wb_list);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
struct nfs_page {
struct list_head wb_list; /* Defines state of page: */
struct page *wb_page; /* page to read in/write out */
struct nfs_open_context *wb_context; /* File state context info */
atomic_t wb_complete; /* i/os we're waiting for */
pgoff_t wb_index; /* Offset >> PAGE_CACHE_SHIFT */
unsigned int wb_offset, /* Offset & ~PAGE_CACHE_MASK */
wb_pgbase, /* Start of page data */
wb_bytes; /* Length of request */
struct kref wb_kref; /* reference count */
unsigned long wb_flags;
struct nfs_writeverf wb_verf; /* Commit cookie */
};

可以清晰的看到在 struct nfs_page 中,链表结点是 struct list_head wb_list

head 是指向 container_of 到底做了什么呢?那就是根据结构体中的成员的地址,计算出结构体的地址.首先,抛开代码考虑这件事情,在给定结构体在确定体系结构上使用确定对齐方式,那么结构体成员相对结构体的偏移量就是一个编译期能确定的常量.那么若有了结构体成员的地址,那么减去相应的偏移量即可得到结构体的地址.这一切在理论上都是可行的,剩下的事只是如何用代码实现.

其次,分析 container_of 的代码实现:

1
const typeof( ((type *)0)->member ) *__mptr = (ptr);

这里使用 typeof 进行类型推断,实现范型编程,声明获得与 member 相同的类型,并添加 * 获得 member 的指针类型.通过这一行,获得了指向结构体成员的指针.同时利用 (type *)( (char *)__mptr - offsetof(type,member) ) 根据 offsetof 关键字获得 membertype 中的偏移量,并使用指针运行将其从结构体成员的地址中减去.
最后则是使用 GNU 扩展的语句表达式语法,避免了需要将宏函数用 do{...}while(0) 包裹的麻烦事.
这些内容足够简单,但宏的运用与衔接十分精妙.

最后在谈谈

1
2
3
4
5
6
7
8
/**
* list_for_each - iterate over a list
* @pos: the &struct list_head to use as a loop cursor.
* @head: the head for your list.
*/
#define list_for_each(pos, head) \
for (pos = (head)->next; prefetch(pos->next), pos != (head); \
pos = pos->next)
1
2
3
#ifndef ARCH_HAS_PREFETCH
#define prefetch(x) __builtin_prefetch(x)
#endif

这里的遍历没什么好提及的,唯一想说说的地方只是 prefetchprefetch 也就是 __builtin_prefetch 看名字不难发现这是 GCC 的内置函数.查一下就能得知这是用来预读数据减少延迟的函数.大概就是防止后面用这个数据的时候出现缓存不命中吧.

好了,本文到此也就结束了.list.h 的别的部分 Y7n05h 认为也没有什么难以理解的内容了.

参考资料

1. LinuxKernel.