1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
|
# Operations
This document describes our ct-sans data collection, including information about
the local system and a timeline leading up to assembling the 2023-04-03 dataset.
## Summary
The initial download time for the current CT logs was 11 days (March 2023). The
time to assemble the final dataset of 0.91B unique SANs (25.2GiB) was 6 hours.
The assembled data set is available here:
- https://dart.cse.kau.se/ct-sans/2023-04-03-ct-sans.zip
## Local system
We're running Ubuntu in a VM:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy
Our VM is configured with 62.9GiB RAM, one CPU core with 32 CPU threads, and a
~2TiB SSD:
$ grep MemTotal /proc/meminfo
processor /proc/cpuinfoemTotal: 65948412 keand
$ grep -c processor /proc/cpuinfo
32
$ grep 'cpu cores' /proc/cpuinfo | uniq
cpu cores : 1
$ df -BG /home
Filesystem 1G-blocks Used Available Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 2077G 220G 1772G 12% /
This VM shares a 1x10Gbps link with other network VMs that we have no control
over. We installed `vnstat` to track our own bandwidth-usage over time:
# apt install vnstat
# systemctl enable vnstat.service
# systemctl start vnstat.service
We also installed Go version 1.20, see [install instructions][]:
$ go version
go version go1.20.2 linux/amd64
[install instructions]: https://go.dev/doc/install
The versions of `git.cs.kau.se/rasmoste/ct-sans@VERSION` are listed below.
## Timeline
| date | time (UTC) | event | notes |
| ---------- | ---------- | --------------------------- | ------------------------------------- |
| 2023/03/18 | 20:05:30 | snapshot and start collect | running v0.0.1, see command notes [1] |
| 2023/03/27 | 14:53:59 | stop collect, bump version | install v0.0.2, see migrate notes [2] |
| 2023/03/27 | 15:03:12 | start collect again | mainly waiting for Argon2023 now [3] |
| 2023/03/29 | 10:22:24 | collect completed | |
| 2023/03/29 | 15:46:44 | snapshot and collect again | download backlog from last 10 days |
| 2023/03/30 | 05:52:38 | collect completed | |
| 2023/03/30 | 08:58:50 | snapshot and collect again | download backlog from last ~16 hours |
| 2023/03/30 | 09:53:34 | collect completed | bandwidth usage statistics [4] |
| 2023/03/30 | 10:05:40 | start assemble | still running v0.0.2 [5] |
| 2023/03/30 | 16:06:39 | assemble done | 0.9B sans (25GiB, 7GiB zipped in 15m) |
| 2023/04/02 | 23:31:37 | snapshot and collect again | download backlog, again |
| 2023/04/03 | 03:54:18 | collect completed | |
| 2023/04/03 | 08:52:28 | snapshot and collect again | final before assembling for real use |
| 2023/04/03 | 09:22:22 | collect completed | |
| 2023/04/03 | 09:30:00 | start assemble | [5] |
| 2023/04/03 | 16:12:38 | assemble done | 0.91B SANs (25.2GiB) from 3.74B certs |
| 2024/02/10 | 09:10:20 | snapshot and start collect | still running v0.0.2 [6] |
| 2024/02/12 | 03:54:13 | abort collection | not needed for our paper contribs [7] |
| 2025/01/06 | | find rate-limit bug | [8] |
| 2025/03/12 | | update Go module name | to reflect current code repo [9] |
## Notes
### 1
$ ct-sans snapshot >snapshot.stdout
$ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr
### 2
In addition to adding the assemble command, `v0.0.2` stores notice.txt files in
each log's directory automatically. This ensures that the output in stdout can
be discarded as opposed to being stored and managed manually in the long run
(e.g., grep for NOTICE prints when assembling data sets).
Commit `ad9fb49670e28414637761bac4b8e8940e2d6770` includes a Go program that
transforms an existing `collect.stderr` file to `notice.txt` files.
Steps to migrate:
- [x] Stop (ctrl+c, wait)
- [x] Move collect.{stdout,stderr} to data/notes/
- [x] `grep NOTICE data/notes/collect.stdout | wc -l` gives 6919 lines
- [x] run the program in the above commit with the appropriate `directory` and
`noticeFile` paths. See output below.
- [x] `wc -l $(find . -name notice.txt) -> total says 6919 lines
- [x] go install git.cs.kau.se/rasmoste/ct-sans@latest, downloaded v0.0.2
- [x] run the same collect command as in note (1); this will not overwrite the
previous collect files because they have been moved to data/notes/. In the
future we will not need to store any of this, but doing it now just in case
something goes wrong.
- [x] The only two logs that had entries left to download resumed
Output from migrate program and santity check:
$ go run .
2023/03/27 14:57:41 Google 'Argon2023' log: 608 notices
2023/03/27 14:57:41 Google 'Argon2024' log: 101 notices
2023/03/27 14:57:41 Google 'Xenon2023' log: 2119 notices
2023/03/27 14:57:41 Google 'Xenon2024' log: 170 notices
2023/03/27 14:57:41 Cloudflare 'Nimbus2023' Log: 2194 notices
2023/03/27 14:57:41 Cloudflare 'Nimbus2024' Log: 164 notices
2023/03/27 14:57:41 DigiCert Yeti2024 Log: 17 notices
2023/03/27 14:57:41 DigiCert Yeti2025 Log: no notices
2023/03/27 14:57:41 DigiCert Nessie2023 Log: 155 notices
2023/03/27 14:57:41 DigiCert Nessie2024 Log: 19 notices
2023/03/27 14:57:41 DigiCert Nessie2025 Log: no notices
2023/03/27 14:57:41 Sectigo 'Sabre' CT log: 1140 notices
2023/03/27 14:57:41 Let's Encrypt 'Oak2023' log: 156 notices
2023/03/27 14:57:41 Let's Encrypt 'Oak2024H1' log: 14 notices
2023/03/27 14:57:41 Let's Encrypt 'Oak2024H2' log: no notices
2023/03/27 14:57:41 Trust Asia Log2023: 62 notices
2023/03/27 14:57:41 Trust Asia Log2024-2: no notices
$ wc -l $(find . -name notice.txt)
101 ./data/logs/eecdd064d5db1acec55cb79db4cd13a23287467cbcecdec351485946711fb59b/notice.txt
14 ./data/logs/3b5377753e2db9804e8b305b06fe403b67d84fc3f4c7bd000d2d726fe1fad417/notice.txt
155 ./data/logs/b3737707e18450f86386d605a9dc11094a792db1670c0b87dcf0030e7936a59a/notice.txt
62 ./data/logs/e87ea7660bc26cf6002ef5725d3fe0e331b9393bb92fbf58eb3b9049daf5435a/notice.txt
164 ./data/logs/dab6bf6b3fb5b6229f9bc2bb5c6be87091716cbb51848534bda43d3048d7fbab/notice.txt
608 ./data/logs/e83ed0da3ef5063532e75728bc896bc903d3cbd1116beceb69e1777d6d06bd6e/notice.txt
156 ./data/logs/b73efb24df9c4dba75f239c5ba58f46c5dfc42cf7a9f35c49e1d098125edb499/notice.txt
1140 ./data/logs/5581d4c2169036014aea0b9b573c53f0c0e43878702508172fa3aa1d0713d30c/notice.txt
19 ./data/logs/73d99e891b4c9678a0207d479de6b2c61cd0515e71192a8c6b80107ac17772b5/notice.txt
170 ./data/logs/76ff883f0ab6fb9551c261ccf587ba34b4a4cdbb29dc68420a9fe6674c5a3a74/notice.txt
2119 ./data/logs/adf7befa7cff10c88b9d3d9c1e3e186ab467295dcfb10c24ca858634ebdc828a/notice.txt
2194 ./data/logs/7a328c54d8b72db620ea38e0521ee98416703213854d3bd22bc13a57a352eb52/notice.txt
17 ./data/logs/48b0e36bdaa647340fe56a02fa9d30eb1c5201cb56dd2c81d9bbbfab39d88473/notice.txt
6919 total
### 3
For some reason Nimbus2023 is stuck at
{"tree_size":512926523,"RootHash":[41,19,83,107,69,253,233,106,68,143,173,151,177,196,60,228,22,57,246,105,184,51,24,50,230,153,233,189,214,93,132,186]}
while trying to fetch until
{"sth_version":0,"tree_size":513025681,"timestamp":1679169572616,"sha256_root_hash":"0SzzS0M2RP5BHC6M9bvOPySYJadPi9nnk2Dsav4NKKs=","tree_head_signature":"BAMARjBEAiBXrmT+W2Ct+32DX/XL+YwS9Ut4rnOG6Y+A4Lxbf/6TogIgYEM32vweDC0QStwMq1PzIvm97cQhj6bUSdZWq/wMkNw=","log_id":"ejKMVNi3LbYg6jjgUh7phBZwMhOFTTvSK8E6V6NS61I="}
These tree heads are not inconsistent, and a restart should resolve the problem.
(There is likely a corner-case somewhere that made the fetcher exit or halt. We
should debug this further at some point; but have not happened more than once.)
### 4
Quick overview:
$ vnstat -d
ens160 / daily
day rx | tx | total | avg. rate
------------------------+-------------+-------------+---------------
2023-03-18 1.49 TiB | 17.07 GiB | 1.51 TiB | 153.44 Mbit/s
2023-03-19 3.77 TiB | 41.21 GiB | 3.81 TiB | 387.83 Mbit/s
2023-03-20 3.09 TiB | 36.67 GiB | 3.13 TiB | 318.26 Mbit/s
2023-03-21 3.11 TiB | 32.24 GiB | 3.14 TiB | 319.61 Mbit/s
2023-03-22 2.08 TiB | 25.98 GiB | 2.10 TiB | 213.89 Mbit/s
2023-03-23 1.16 TiB | 15.59 GiB | 1.18 TiB | 119.97 Mbit/s
2023-03-24 1.17 TiB | 15.44 GiB | 1.18 TiB | 120.44 Mbit/s
2023-03-25 1.18 TiB | 15.72 GiB | 1.19 TiB | 121.55 Mbit/s
2023-03-26 707.47 GiB | 9.64 GiB | 717.11 GiB | 71.30 Mbit/s
2023-03-27 448.80 GiB | 6.43 GiB | 455.23 GiB | 45.26 Mbit/s
2023-03-28 451.49 GiB | 6.49 GiB | 457.98 GiB | 45.53 Mbit/s
2023-03-29 1.01 TiB | 12.73 GiB | 1.03 TiB | 104.45 Mbit/s
2023-03-30 256.75 GiB | 3.40 GiB | 260.15 GiB | 59.59 Mbit/s
------------------------+-------------+-------------+---------------
estimated 591.55 GiB | 7.84 GiB | 599.39 GiB |
### 5
Use at most 58GiB RAM for sorting, 8 parallel sort workers. More than this does
not improve performance according to the [GNU sort manual][]. We're also
setting the `LC_ALL=C` variable to ensure consistent sort order (see man).
$ export LC_ALL=C
$ ct-sans assemble -b 58 -p 8 >assemble.stdout
[GNU sort manual]: https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
(We don't need to change the default directories, because the collected data is
stored in ./data and /tmp is a fine place to put things on our system.)
### 6
There are 0.91B unique SANs in the 25.2GiB dataset (6.1GiB compressed):
$ du -shb data/archive/2023-04-03-ct-sans
27050799992 data/archive/2023-04-03-ct-sans
$ python3 -c "print(f'{27050799992 / 1024**3:.1f}GiB')"
25.2GiB
$ du -shb data/archive/2023-04-03-ct-sans.zip
6526876407 data/archive/2023-04-03-ct-sans.zip
$ python3 -c "print(f'{6526876407 / 1024**3:.1f}GiB')"
6.1GiB
$ wc -l data/archive/2023-04-03-ct-sans/sans.lst
907332515 data/archive/2023-04-03-ct-sans/sans.lst
$ python3 -c "print(f'{907332515 / 1000**3:.2f}B')"
0.91B
These SANs were found in 3.74B certificates from 17 CT logs:
$ grep "In total," data/archive/2023-04-03-ct-sans/README.md
In total, 3743244652 certificates were downloaded from 17 CT logs;
$ python3 -c "print(f'{3743244652 / 1000**3:.2f}B')"
3.74B
## 6
$ ct-sans snapshot >snapshot.stdout
$ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr
## 7
We decided to abort another round of ct-sans (and following onion-grab)
measurements because it is not strictly needed to achieve our goals. If we want
to make more measurements for the sake of making the ct-sans data set available,
we should automate it rather than doing it manually as in this timeline.
## 8
The upstream library that we're using doesn't enforce rate-limits on the
get-entries endpoint (despite the library having a codepath for this), see:
- <https://github.com/google/certificate-transparency-go/issues/898>
A fix was applied to ct-sans on 2025-03-12, use tag v0.1.0.
What this rate-limit bug means for us: our workers did not backoff on HTTP
status 429. There were no log output for these kind of responses, so it is
impossible to tell two years later if we were seing any status 429 or not.
What can be said is that we are receiving such responses for Google and Let's
Encrypt when running with the same number of workers today. For Let's Encrypt,
throughput is roughly the same as when we measured when adjusting the number of
works so that we don't see any/many status 429. For Google, we get roughly half
the throughput compared to when we measured. Since we tuned the number of
workers [manually][] for each log two years ago (including finding when more
workers gave worse performance -- presumable because of too many status 429),
then we were probably not overshooting by a lot. But again, hard to say.
[manually]: https://git.rgdd.se/ct-sans/tree/utils_ct.go?h=v0.0.2#n42
## 9
Code is hosted at <https://git.rgdd.se/ct-sans>.
The Go module name is `rgdd.se/ct-sans`, effective from v0.1.0 and forward. It
is still possible to install older versions with the old `git.cs.kau.se` Go
module name (thanks to Go's infrastructure), as shown earlier in the timeline.
|