# Operations This document describes our ct-sans data collection, including information about the local system and a timeline leading up to assembling the 2023-04-03 dataset. ## Summary The initial download time for the current CT logs was 11 days (March 2023). The time to assemble the final dataset of 0.91B unique SANs (25.2GiB) was 6 hours. The assembled data set is available here: - https://dart.cse.kau.se/ct-sans/2023-04-03-ct-sans.zip ## Local system We're running Ubuntu in a VM: $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.2 LTS Release: 22.04 Codename: jammy Our VM is configured with 62.9GiB RAM, one CPU core with 32 CPU threads, and a ~2TiB SSD: $ grep MemTotal /proc/meminfo processor /proc/cpuinfoemTotal: 65948412 keand $ grep -c processor /proc/cpuinfo 32 $ grep 'cpu cores' /proc/cpuinfo | uniq cpu cores : 1 $ df -BG /home Filesystem 1G-blocks Used Available Use% Mounted on /dev/mapper/ubuntu--vg-ubuntu--lv 2077G 220G 1772G 12% / This VM shares a 1x10Gbps link with other network VMs that we have no control over. We installed `vnstat` to track our own bandwidth-usage over time: # apt install vnstat # systemctl enable vnstat.service # systemctl start vnstat.service We also installed Go version 1.20, see [install instructions][]: $ go version go version go1.20.2 linux/amd64 [install instructions]: https://go.dev/doc/install The versions of `git.cs.kau.se/rasmoste/ct-sans@VERSION` are listed below. ## Timeline | date | time (UTC) | event | notes | | ---------- | ---------- | --------------------------- | ------------------------------------- | | 2023/03/18 | 20:05:30 | snapshot and start collect | running v0.0.1, see command notes [1] | | 2023/03/27 | 14:53:59 | stop collect, bump version | install v0.0.2, see migrate notes [2] | | 2023/03/27 | 15:03:12 | start collect again | mainly waiting for Argon2023 now [3] | | 2023/03/29 | 10:22:24 | collect completed | | | 2023/03/29 | 15:46:44 | snapshot and collect again | download backlog from last 10 days | | 2023/03/30 | 05:52:38 | collect completed | | | 2023/03/30 | 08:58:50 | snapshot and collect again | download backlog from last ~16 hours | | 2023/03/30 | 09:53:34 | collect completed | bandwidth usage statistics [4] | | 2023/03/30 | 10:05:40 | start assemble | still running v0.0.2 [5] | | 2023/03/30 | 16:06:39 | assemble done | 0.9B sans (25GiB, 7GiB zipped in 15m) | | 2023/04/02 | 23:31:37 | snapshot and collect again | download backlog, again | | 2023/04/03 | 03:54:18 | collect completed | | | 2023/04/03 | 08:52:28 | snapshot and collect again | final before assembling for real use | | 2023/04/03 | 09:22:22 | collect completed | | | 2023/04/03 | 09:30:00 | start assemble | [5] | | 2023/04/03 | 16:12:38 | assemble done | 0.91B SANs (25.2GiB) from 3.74B certs | | 2024/02/10 | 09:10:20 | snapshot and start collect | still running v0.0.2 [6] | ## Notes ### 1 $ ct-sans snapshot >snapshot.stdout $ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr ### 2 In addition to adding the assemble command, `v0.0.2` stores notice.txt files in each log's directory automatically. This ensures that the output in stdout can be discarded as opposed to being stored and managed manually in the long run (e.g., grep for NOTICE prints when assembling data sets). Commit `ad9fb49670e28414637761bac4b8e8940e2d6770` includes a Go program that transforms an existing `collect.stderr` file to `notice.txt` files. Steps to migrate: - [x] Stop (ctrl+c, wait) - [x] Move collect.{stdout,stderr} to data/notes/ - [x] `grep NOTICE data/notes/collect.stdout | wc -l` gives 6919 lines - [x] run the program in the above commit with the appropriate `directory` and `noticeFile` paths. See output below. - [x] `wc -l $(find . -name notice.txt) -> total says 6919 lines - [x] go install git.cs.kau.se/rasmoste/ct-sans@latest, downloaded v0.0.2 - [x] run the same collect command as in note (1); this will not overwrite the previous collect files because they have been moved to data/notes/. In the future we will not need to store any of this, but doing it now just in case something goes wrong. - [x] The only two logs that had entries left to download resumed Output from migrate program and santity check: $ go run . 2023/03/27 14:57:41 Google 'Argon2023' log: 608 notices 2023/03/27 14:57:41 Google 'Argon2024' log: 101 notices 2023/03/27 14:57:41 Google 'Xenon2023' log: 2119 notices 2023/03/27 14:57:41 Google 'Xenon2024' log: 170 notices 2023/03/27 14:57:41 Cloudflare 'Nimbus2023' Log: 2194 notices 2023/03/27 14:57:41 Cloudflare 'Nimbus2024' Log: 164 notices 2023/03/27 14:57:41 DigiCert Yeti2024 Log: 17 notices 2023/03/27 14:57:41 DigiCert Yeti2025 Log: no notices 2023/03/27 14:57:41 DigiCert Nessie2023 Log: 155 notices 2023/03/27 14:57:41 DigiCert Nessie2024 Log: 19 notices 2023/03/27 14:57:41 DigiCert Nessie2025 Log: no notices 2023/03/27 14:57:41 Sectigo 'Sabre' CT log: 1140 notices 2023/03/27 14:57:41 Let's Encrypt 'Oak2023' log: 156 notices 2023/03/27 14:57:41 Let's Encrypt 'Oak2024H1' log: 14 notices 2023/03/27 14:57:41 Let's Encrypt 'Oak2024H2' log: no notices 2023/03/27 14:57:41 Trust Asia Log2023: 62 notices 2023/03/27 14:57:41 Trust Asia Log2024-2: no notices $ wc -l $(find . -name notice.txt) 101 ./data/logs/eecdd064d5db1acec55cb79db4cd13a23287467cbcecdec351485946711fb59b/notice.txt 14 ./data/logs/3b5377753e2db9804e8b305b06fe403b67d84fc3f4c7bd000d2d726fe1fad417/notice.txt 155 ./data/logs/b3737707e18450f86386d605a9dc11094a792db1670c0b87dcf0030e7936a59a/notice.txt 62 ./data/logs/e87ea7660bc26cf6002ef5725d3fe0e331b9393bb92fbf58eb3b9049daf5435a/notice.txt 164 ./data/logs/dab6bf6b3fb5b6229f9bc2bb5c6be87091716cbb51848534bda43d3048d7fbab/notice.txt 608 ./data/logs/e83ed0da3ef5063532e75728bc896bc903d3cbd1116beceb69e1777d6d06bd6e/notice.txt 156 ./data/logs/b73efb24df9c4dba75f239c5ba58f46c5dfc42cf7a9f35c49e1d098125edb499/notice.txt 1140 ./data/logs/5581d4c2169036014aea0b9b573c53f0c0e43878702508172fa3aa1d0713d30c/notice.txt 19 ./data/logs/73d99e891b4c9678a0207d479de6b2c61cd0515e71192a8c6b80107ac17772b5/notice.txt 170 ./data/logs/76ff883f0ab6fb9551c261ccf587ba34b4a4cdbb29dc68420a9fe6674c5a3a74/notice.txt 2119 ./data/logs/adf7befa7cff10c88b9d3d9c1e3e186ab467295dcfb10c24ca858634ebdc828a/notice.txt 2194 ./data/logs/7a328c54d8b72db620ea38e0521ee98416703213854d3bd22bc13a57a352eb52/notice.txt 17 ./data/logs/48b0e36bdaa647340fe56a02fa9d30eb1c5201cb56dd2c81d9bbbfab39d88473/notice.txt 6919 total ### 3 For some reason Nimbus2023 is stuck at {"tree_size":512926523,"RootHash":[41,19,83,107,69,253,233,106,68,143,173,151,177,196,60,228,22,57,246,105,184,51,24,50,230,153,233,189,214,93,132,186]} while trying to fetch until {"sth_version":0,"tree_size":513025681,"timestamp":1679169572616,"sha256_root_hash":"0SzzS0M2RP5BHC6M9bvOPySYJadPi9nnk2Dsav4NKKs=","tree_head_signature":"BAMARjBEAiBXrmT+W2Ct+32DX/XL+YwS9Ut4rnOG6Y+A4Lxbf/6TogIgYEM32vweDC0QStwMq1PzIvm97cQhj6bUSdZWq/wMkNw=","log_id":"ejKMVNi3LbYg6jjgUh7phBZwMhOFTTvSK8E6V6NS61I="} These tree heads are not inconsistent, and a restart should resolve the problem. (There is likely a corner-case somewhere that made the fetcher exit or halt. We should debug this further at some point; but have not happened more than once.) ### 4 Quick overview: $ vnstat -d ens160 / daily day rx | tx | total | avg. rate ------------------------+-------------+-------------+--------------- 2023-03-18 1.49 TiB | 17.07 GiB | 1.51 TiB | 153.44 Mbit/s 2023-03-19 3.77 TiB | 41.21 GiB | 3.81 TiB | 387.83 Mbit/s 2023-03-20 3.09 TiB | 36.67 GiB | 3.13 TiB | 318.26 Mbit/s 2023-03-21 3.11 TiB | 32.24 GiB | 3.14 TiB | 319.61 Mbit/s 2023-03-22 2.08 TiB | 25.98 GiB | 2.10 TiB | 213.89 Mbit/s 2023-03-23 1.16 TiB | 15.59 GiB | 1.18 TiB | 119.97 Mbit/s 2023-03-24 1.17 TiB | 15.44 GiB | 1.18 TiB | 120.44 Mbit/s 2023-03-25 1.18 TiB | 15.72 GiB | 1.19 TiB | 121.55 Mbit/s 2023-03-26 707.47 GiB | 9.64 GiB | 717.11 GiB | 71.30 Mbit/s 2023-03-27 448.80 GiB | 6.43 GiB | 455.23 GiB | 45.26 Mbit/s 2023-03-28 451.49 GiB | 6.49 GiB | 457.98 GiB | 45.53 Mbit/s 2023-03-29 1.01 TiB | 12.73 GiB | 1.03 TiB | 104.45 Mbit/s 2023-03-30 256.75 GiB | 3.40 GiB | 260.15 GiB | 59.59 Mbit/s ------------------------+-------------+-------------+--------------- estimated 591.55 GiB | 7.84 GiB | 599.39 GiB | ### 5 Use at most 58GiB RAM for sorting, 8 parallel sort workers. More than this does not improve performance according to the [GNU sort manual][]. We're also setting the `LC_ALL=C` variable to ensure consistent sort order (see man). $ export LC_ALL=C $ ct-sans assemble -b 58 -p 8 >assemble.stdout [GNU sort manual]: https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html (We don't need to change the default directories, because the collected data is stored in ./data and /tmp is a fine place to put things on our system.) ### 6 There are 0.91B unique SANs in the 25.2GiB dataset (6.1GiB compressed): $ du -shb data/archive/2023-04-03-ct-sans 27050799992 data/archive/2023-04-03-ct-sans $ python3 -c "print(f'{27050799992 / 1024**3:.1f}GiB')" 25.2GiB $ du -shb data/archive/2023-04-03-ct-sans.zip 6526876407 data/archive/2023-04-03-ct-sans.zip $ python3 -c "print(f'{6526876407 / 1024**3:.1f}GiB')" 6.1GiB $ wc -l data/archive/2023-04-03-ct-sans/sans.lst 907332515 data/archive/2023-04-03-ct-sans/sans.lst $ python3 -c "print(f'{907332515 / 1000**3:.2f}B')" 0.91B These SANs were found in 3.74B certificates from 17 CT logs: $ grep "In total," data/archive/2023-04-03-ct-sans/README.md In total, 3743244652 certificates were downloaded from 17 CT logs; $ python3 -c "print(f'{3743244652 / 1000**3:.2f}B')" 3.74B ## 6 $ ct-sans snapshot >snapshot.stdout $ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr