Intro#
Hi there!
I am working on a database, particularly building the storage engine(https://github.com/lnikon/tinykvpp) right
now, and I assigned myself the task of creating separate file abstractions(e.g.
I really like how it is done in leveldb, thus having RandomAccessFile, SequentialFile, AppendOnlyFile, etc…).
Buffered IO versus System IO#
So, while designing my approach to this, I started evaluating my current way of doing file IO - fstreams
versus plain POSIX write(), and simple Google benchmark:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
| static void
BM_BenchmarkFstreamWrite(benchmark::State & state) {
const std::string filename("test_stream.txt");
std::fstream fs(filename, std::fstream::in |
std::fstream::out |
std::fstream::app |
std::fstream::ate);
if (!fs.is_open()) {
std::cerr << "unable to open" << filename << '\n';
exit(EXIT_FAILURE);
}
std::string payload("aaaaa");
for (auto _: state) {
benchmark::DoNotOptimize(fs.write(payload.c_str(), payload.size()));
}
std::filesystem::remove(filename);
}
BENCHMARK(BM_BenchmarkFstreamWrite);
static void
BM_BenchmarkPosixWrite(benchmark::State & state) {
const
std::string filename("test_stream_2.txt");
int
fd = open(filename.c_str(), O_WRONLY | O_APPEND | O_CREAT, 0644);
if (fd == -1) {
std::cerr << "Unable to open " << filename << '\n';
}
std::string payload("bbbbb");
for (
auto _: state) {
write(fd, payload.c_str(), payload.size());
}
std::filesystem::remove(filename);
}
BENCHMARK(BM_BenchmarkPosixWrite);
static void BM_BenchmarkFstreamWrite(benchmark::State & state) {
const std::string filename("test_stream.txt");
std::fstream fs(filename, std::fstream::in | std::fstream::out | std::fstream::app | std::fstream::ate);
if (!fs.is_open()) {
std::cerr << "unable to open" << filename << '\n';
exit(EXIT_FAILURE);
}
std::string payload("aaaaa");
for (auto _: state) {
benchmark::DoNotOptimize(fs.write(payload.c_str(), payload.size()));
}
std::filesystem::remove(filename);
}
BENCHMARK(BM_BenchmarkFstreamWrite);
static void BM_BenchmarkPosixWrite(benchmark::State & state) {
const std::string filename("test_stream_2.txt");
int fd = open(filename.c_str(), O_WRONLY | O_APPEND | O_CREAT, 0644);
if (fd == -1) {
std::cerr << "Unable to open " << filename << '\n';
}
std::string payload("bbbbb");
for (auto _: state) {
write(fd, payload.c_str(), payload.size());
}
std::filesystem::remove(filename);
}
BENCHMARK(BM_BenchmarkPosixWrite);
|
shows following result:
1
2
3
4
5
| Benchmark Time CPU Iterations
BM_BenchmarkFstreamWrite 24.0 ns 24.0 ns 30455681
BM_BenchmarkPosixWrite 2001 ns 2001 ns 356052
|
Payload size and std::fstream buffering#
So, simple POSIX write() is slower almost 100 times. This result really shocked me.
I decided to strace the program and saw that for sufficiently big
inputs std::fstream::write’s got vectorized! What do I mean by ‘sufficiently big input’?
Let’s examine the following small programs and their straces.
To simulate the load I wrapped the write() into a while loop.
The payload size is the same for all test cases, only the iteration count changes.
Iteration count: 3
1
2
3
4
5
6
7
| size_t count {
3
};
std::string payload("aaa");
while (count-- != 0) {
fs << payload;
}
|
Strace:
1
2
3
4
| openat(AT_FDCWD, "test_stream_3.txt", O_RDWR|O_CREAT|O_APPEND, 0666) = 3
lseek(3, 0, SEEK_END) = 40755
write(3, "aaaaaaaaa", 9) = 9
close(3) = 0
|
What we see is a simple write of 9 characters. But what would I’ve expected is to see three writes of 3 chars. So std::fsteam does internal buffering? Maybe that’s the way it derives from std::basic_filebuf?
Okay, next program:
Iteration count 9 * 1024:
1
2
3
4
5
6
7
| size_t count {
9 * 1024
};
std::string payload("aaa");
while (count-- != 0) {
fs << payload;
}
|
Strace:
1
2
3
4
5
6
7
8
9
10
11
12
13
| openat(AT_FDCWD, "test_stream_3.txt", O_RDWR|O_CREAT|O_APPEND, 0666) = 3
lseek(3, 0, SEEK_END) = 49980
writev(3, [{iov_base="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., iov_len=8190}, {iov_base="aaa", iov_len=3}], 2) = 8193
writev(3, [{iov_base="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., iov_len=8190}, {iov_base="aaa", iov_len=3}], 2) = 8193
writev(3, [{iov_base="aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., iov_len=8190}, {iov_base="aaa", iov_len=3}], 2) = 8193
write(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 3069) = 3069
close(3) = 0
|
So what we see is that std::fstream is smart enough to auto-vectorize IO! This is really surprising to me.
After seeing this trace I decided to dig into the libstdc++ and find out the logic behind vectorization,
and I found… nothing. Grep showed zero calls to writev().
A simple search in fstream and related headers did not reference writev().
So the natural question is: where does vectorization happen?
The next experiment I conducted was to try to see if maybe the kernel somehow decides
that it can vectorize things?
So I drafted the same logic using plain write() calls.
1
2
3
4
5
6
7
| std::string payload("bbbbb");
size_t count {
9 * 1024
};
while (count-- != 0) {
write(fd, payload.c_str(), payload.size());
}
|
strace:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| ......
write(3, "bbbbb", 5) = 5
write(3, "bbbbb", 5) = 5
write(3, "bbbbb", 5) = 5
write(3, "bbbbb", 5) = 5
write(3, "bbbbb", 5) = 5
write(3, "bbbbb", 5) = 5
write(3, "bbbbb", 5) = 5
write(3, "bbbbb", 5) = 5
write(3, "bbbbb", 5) = 5
exit_group(0) = ?
|
Strace just shows a ton of calls to write: exactly what I expected when std::fstream::write()-ing
into the stream.
Manual IO vectorizaton via writev()#
And the last experiment was to benchmark explicitly vectorized IO against fstream.
1
2
3
4
5
6
7
8
9
| std::vector <iovec> iov;
std::string payload("bbbbb");
for (
auto _: state) {
iov.emplace_back(iovec {
.iov_base = payload.data(), .iov_len = payload.size()
});
}
writev(fd, iov.data(), iov.size());
|
And the benchmark showed interesting result:
1
2
3
4
5
6
7
| Benchmark Time CPU Iterations
BM_BenchmarkFstreamWrite 24.0 ns 24.0 ns 28733493
BM_BenchmarkPosixWrite 1828 ns 1828 ns 384717
BM_BenchmarkPosixScatterWrite 37.9 ns 37.9 ns 16676420
|
DIY writev() is almost two times slower than std::fstream::write!
What are your interpretations of these results? How to understand where the vectorization happens? Maybe I should abandon my idea of having custom file abstractions and use std::fstream.
I am using 6.10.3-1-default with gcc 13.3.1 with optimization levels set to -O2.
Thank you for reading!