From e4d01585d9802a256d754072bdce2b855ae7d354 Mon Sep 17 00:00:00 2001 From: Rasmus Dahlberg Date: Fri, 7 Apr 2023 18:00:35 +0200 Subject: Add operations timeline --- README.md | 41 ++++++---- docs/operations.md | 218 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 244 insertions(+), 15 deletions(-) create mode 100644 docs/operations.md diff --git a/README.md b/README.md index 830c024..1017c02 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,9 @@ A tool that downloads certificates from [CT logs][] [recognized by Google Chrome][], storing the encountered [Subject Alternative Names (SANs)][] to disk. -The final data set `sans.lst` is de-duplicated and contains one SAN per line. +The dataset can be assembled so that it is de-duplicated with one SAN per line. + +**Availability:** [2023-04-03-ct-sans dataset](./docs/operations.md) [CT logs]: https://certificate.transparency.dev/ [recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto/ @@ -12,15 +14,23 @@ The final data set `sans.lst` is de-duplicated and contains one SAN per line. ## Quick start -You will need a Go compiler and GNU sort on the local system: - $ which go || echo "Go compiler is not in $PATH" - $ which sort || echo "GNU sort is not in $PATH" +### Install + +You will need a [Go compiler][] and [GNU sort][] on the local system: + + $ which go >/dev/null || echo "Go compiler not PATH" + $ which sort >/dev/null || echo "GNU sort not PATH" + +[Go compiler]: https://go.dev/doc/install +[GNU sort]: https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html Install `ct-sans`: $ go install git.cs.kau.se/rasmoste/ct-sans@latest - $ which ct-sans || echo "ct-sans is not in $PATH" + $ which ct-sans >/dev/null || echo "ct-sans not in PATH" + +### Snapshot Download and verify the signature of Google's list of known logs, then download and verify the signatures of the logs' tree heads: @@ -49,6 +59,8 @@ then download and verify the signatures of the logs' tree heads: Subsequent uses of the `snapshot` command will update the signed list of known logs, then update the logs' signed tree heads after verifying consistency. +### Collect + Download and verify the logs' Merkle trees up until the current snapshot: $ ct-sans collect -d $HOME/ct-sans-demo @@ -75,13 +87,10 @@ Download and verify the logs' Merkle trees up until the current snapshot: This will take a while depending on the local system, configuration of the optional `collect` flags, as well as how heavily the logs apply rate-limits. -For good performance while respecting rate-limits, you may want to try -`--workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m`. This allowed -us to download the logs (March 2023) in approximately 10 days. Our machine was -located in EU with 2TiB SSD, 64GiB memory, 16 CPU cores, and 1Gbps line-speed. +For reference, we [downloaded the logs](./docs/operations.md) from scratch in +less than 11 days using a single-IP machine that respects the logs' rate-limits. -Of note is that it is safe to ctrl+C while collecting. Just wait for the -`collect` command to exit on its own so that things are persisted to disk. +### Assemble Once the collect phase is done, assemble the data set: @@ -117,11 +126,13 @@ Once the collect phase is done, assemble the data set: The SANs data set is sorted and de-duplicated, one SAN per line. -**Note:** the different `ct-sans` commands must not run at the same time. - -## Updating the data set +### Good to know -Simply run the same `snapshot`, `collect`, and `assemble` commands again. + - It is safe to ctrl+C while collecting. Just wait for the `collect` command + to exit on its own so that things are persisted to disk. + - The different `ct-sans` commands must not run at the same time. + - The dataset can be updated by running the same `snapshot`, `collect` and + `assemble` commands again. ## Contact diff --git a/docs/operations.md b/docs/operations.md new file mode 100644 index 0000000..458ec13 --- /dev/null +++ b/docs/operations.md @@ -0,0 +1,218 @@ +# Operations + +This document describes our ct-sans data collection, including information about +the local system and a timeline leading up to assembling the 2023-04-03 dataset. + +## Summary + +The initial download time for the current CT logs was 11 days (March 2023). The +time to assemble the final dataset of 0.91B unique SANs (25.2GiB) was 6 hours. + +The assembled data set can be downloaded [here](TODO). + +## Local system + +We're running Ubuntu in a VM: + + $ lsb_release -a + No LSB modules are available. + Distributor ID: Ubuntu + Description: Ubuntu 22.04.2 LTS + Release: 22.04 + Codename: jammy + +Our VM is configured with 62.9GiB RAM, one CPU core with 32 CPU threads, and a +~2TiB SSD: + + $ grep MemTotal /proc/meminfo + processor /proc/cpuinfoemTotal: 65948412 keand + $ grep -c processor /proc/cpuinfo + 32 + $ grep 'cpu cores' /proc/cpuinfo | uniq + cpu cores : 1 + $ df -BG /home + Filesystem 1G-blocks Used Available Use% Mounted on + /dev/mapper/ubuntu--vg-ubuntu--lv 2077G 220G 1772G 12% / + +This VM shares a 1x10Gbps link with other network VMs that we have no control +over. We installed `vnstat` to track our own bandwidth-usage over time: + + # apt install vnstat + # systemctl enable vnstat.service + # systemctl start vnstat.service + +We also installed Go version 1.20, see [install instructions][]: + + $ go version + go version go1.20.2 linux/amd64 + +[install instructions]: https://go.dev/doc/install + +The versions of `git.cs.kau.se/rasmoste/ct-sans@VERSION` are listed below. + +## Timeline + +| date | time (UTC) | event | notes | +| ---------- | ---------- | --------------------------- | ------------------------------------- | +| 2023/03/18 | 20:05:30 | snapshot and start collect | running v0.0.1, see command notes [1] | +| 2023/03/27 | 14:53:59 | stop collect, bump version | install v0.0.2, see migrate notes [2] | +| 2023/03/27 | 15:03:12 | start collect again | mainly waiting for Argon2023 now [3] | +| 2023/03/29 | 10:22:24 | collect completed | | +| 2023/03/29 | 15:46:44 | snapshot and collect again | download backlog from last 10 days | +| 2023/03/30 | 05:52:38 | collect completed | | +| 2023/03/30 | 08:58:50 | snapshot and collect again | download backlog from last ~16 hours | +| 2023/03/30 | 09:53:34 | collect completed | bandwidth usage statistics [4] | +| 2023/03/30 | 10:05:40 | start assemble | still running v0.0.2 [5] | +| 2023/03/30 | 16:06:39 | assemble done | 0.9B sans (25GiB, 7GiB zipped in 15m) | +| 2023/04/02 | 23:31:37 | snapshot and collect again | download backlog, again | +| 2023/04/03 | 03:54:18 | collect completed | | +| 2023/04/03 | 08:52:28 | snapshot and collect again | final before assembling for real use | +| 2023/04/03 | 09:22:22 | collect completed | | +| 2023/04/03 | 09:30:00 | start assemble | [5] | +| 2023/04/03 | 16:12:38 | assemble done | 0.91B SANs (25.2GiB) from 3.74B certs | + +## Notes + +### 1 + + $ ct-sans snapshot >snapshot.stdout + $ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr + +### 2 + +In addition to adding the assemble command, `v0.0.2` stores notice.txt files in +each log's directory automatically. This ensures that the output in stdout can +be discarded as opposed to being stored and managed manually in the long run +(e.g., grep for NOTICE prints when assembling data sets). + +Commit `ad9fb49670e28414637761bac4b8e8940e2d6770` includes a Go program that +transforms an existing `collect.stderr` file to `notice.txt` files. + +Steps to migrate: + + - [x] Stop (ctrl+c, wait) + - [x] Move collect.{stdout,stderr} to data/notes/ + - [x] `grep NOTICE data/notes/collect.stdout | wc -l` gives 6919 lines + - [x] run the program in the above commit with the appropriate `directory` and + `noticeFile` paths. See output below. + - [x] `wc -l $(find . -name notice.txt) -> total says 6919 lines + - [x] go install git.cs.kau.se/rasmoste/ct-sans@latest, downloaded v0.0.2 + - [x] run the same collect command as in note (1); this will not overwrite the + previous collect files because they have been moved to data/notes/. In the + future we will not need to store any of this, but doing it now just in case + something goes wrong. + - [x] The only two logs that had entries left to download resumed + +Output from migrate program and santity check: + + $ go run . + 2023/03/27 14:57:41 Google 'Argon2023' log: 608 notices + 2023/03/27 14:57:41 Google 'Argon2024' log: 101 notices + 2023/03/27 14:57:41 Google 'Xenon2023' log: 2119 notices + 2023/03/27 14:57:41 Google 'Xenon2024' log: 170 notices + 2023/03/27 14:57:41 Cloudflare 'Nimbus2023' Log: 2194 notices + 2023/03/27 14:57:41 Cloudflare 'Nimbus2024' Log: 164 notices + 2023/03/27 14:57:41 DigiCert Yeti2024 Log: 17 notices + 2023/03/27 14:57:41 DigiCert Yeti2025 Log: no notices + 2023/03/27 14:57:41 DigiCert Nessie2023 Log: 155 notices + 2023/03/27 14:57:41 DigiCert Nessie2024 Log: 19 notices + 2023/03/27 14:57:41 DigiCert Nessie2025 Log: no notices + 2023/03/27 14:57:41 Sectigo 'Sabre' CT log: 1140 notices + 2023/03/27 14:57:41 Let's Encrypt 'Oak2023' log: 156 notices + 2023/03/27 14:57:41 Let's Encrypt 'Oak2024H1' log: 14 notices + 2023/03/27 14:57:41 Let's Encrypt 'Oak2024H2' log: no notices + 2023/03/27 14:57:41 Trust Asia Log2023: 62 notices + 2023/03/27 14:57:41 Trust Asia Log2024-2: no notices + $ wc -l $(find . -name notice.txt) + 101 ./data/logs/eecdd064d5db1acec55cb79db4cd13a23287467cbcecdec351485946711fb59b/notice.txt + 14 ./data/logs/3b5377753e2db9804e8b305b06fe403b67d84fc3f4c7bd000d2d726fe1fad417/notice.txt + 155 ./data/logs/b3737707e18450f86386d605a9dc11094a792db1670c0b87dcf0030e7936a59a/notice.txt + 62 ./data/logs/e87ea7660bc26cf6002ef5725d3fe0e331b9393bb92fbf58eb3b9049daf5435a/notice.txt + 164 ./data/logs/dab6bf6b3fb5b6229f9bc2bb5c6be87091716cbb51848534bda43d3048d7fbab/notice.txt + 608 ./data/logs/e83ed0da3ef5063532e75728bc896bc903d3cbd1116beceb69e1777d6d06bd6e/notice.txt + 156 ./data/logs/b73efb24df9c4dba75f239c5ba58f46c5dfc42cf7a9f35c49e1d098125edb499/notice.txt + 1140 ./data/logs/5581d4c2169036014aea0b9b573c53f0c0e43878702508172fa3aa1d0713d30c/notice.txt + 19 ./data/logs/73d99e891b4c9678a0207d479de6b2c61cd0515e71192a8c6b80107ac17772b5/notice.txt + 170 ./data/logs/76ff883f0ab6fb9551c261ccf587ba34b4a4cdbb29dc68420a9fe6674c5a3a74/notice.txt + 2119 ./data/logs/adf7befa7cff10c88b9d3d9c1e3e186ab467295dcfb10c24ca858634ebdc828a/notice.txt + 2194 ./data/logs/7a328c54d8b72db620ea38e0521ee98416703213854d3bd22bc13a57a352eb52/notice.txt + 17 ./data/logs/48b0e36bdaa647340fe56a02fa9d30eb1c5201cb56dd2c81d9bbbfab39d88473/notice.txt + 6919 total + +### 3 + +For some reason Nimbus2023 is stuck at + + {"tree_size":512926523,"RootHash":[41,19,83,107,69,253,233,106,68,143,173,151,177,196,60,228,22,57,246,105,184,51,24,50,230,153,233,189,214,93,132,186]} + +while trying to fetch until + + {"sth_version":0,"tree_size":513025681,"timestamp":1679169572616,"sha256_root_hash":"0SzzS0M2RP5BHC6M9bvOPySYJadPi9nnk2Dsav4NKKs=","tree_head_signature":"BAMARjBEAiBXrmT+W2Ct+32DX/XL+YwS9Ut4rnOG6Y+A4Lxbf/6TogIgYEM32vweDC0QStwMq1PzIvm97cQhj6bUSdZWq/wMkNw=","log_id":"ejKMVNi3LbYg6jjgUh7phBZwMhOFTTvSK8E6V6NS61I="} + +These tree heads are not inconsistent, and a restart should resolve the problem. + +(There is likely a corner-case somewhere that made the fetcher exit or halt. We +should debug this further at some point; but have not happened more than once.) + +### 4 + +Quick overview: + + $ vnstat -d + ens160 / daily + + day rx | tx | total | avg. rate + ------------------------+-------------+-------------+--------------- + 2023-03-18 1.49 TiB | 17.07 GiB | 1.51 TiB | 153.44 Mbit/s + 2023-03-19 3.77 TiB | 41.21 GiB | 3.81 TiB | 387.83 Mbit/s + 2023-03-20 3.09 TiB | 36.67 GiB | 3.13 TiB | 318.26 Mbit/s + 2023-03-21 3.11 TiB | 32.24 GiB | 3.14 TiB | 319.61 Mbit/s + 2023-03-22 2.08 TiB | 25.98 GiB | 2.10 TiB | 213.89 Mbit/s + 2023-03-23 1.16 TiB | 15.59 GiB | 1.18 TiB | 119.97 Mbit/s + 2023-03-24 1.17 TiB | 15.44 GiB | 1.18 TiB | 120.44 Mbit/s + 2023-03-25 1.18 TiB | 15.72 GiB | 1.19 TiB | 121.55 Mbit/s + 2023-03-26 707.47 GiB | 9.64 GiB | 717.11 GiB | 71.30 Mbit/s + 2023-03-27 448.80 GiB | 6.43 GiB | 455.23 GiB | 45.26 Mbit/s + 2023-03-28 451.49 GiB | 6.49 GiB | 457.98 GiB | 45.53 Mbit/s + 2023-03-29 1.01 TiB | 12.73 GiB | 1.03 TiB | 104.45 Mbit/s + 2023-03-30 256.75 GiB | 3.40 GiB | 260.15 GiB | 59.59 Mbit/s + ------------------------+-------------+-------------+--------------- + estimated 591.55 GiB | 7.84 GiB | 599.39 GiB | + +### 5 + +Use at most 58GiB RAM for sorting, 8 parallel sort workers. More than this does +not improve performance according to the [GNU sort manual][]. We're also +setting the `LC_ALL=C` variable to ensure consistent sort order (see man). + + $ export LC_ALL=C + $ ct-sans assemble -b 58 -p 8 >assemble.stdout + +[GNU sort manual]: https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html + +(We don't need to change the default directories, because the collected data is +stored in ./data and /tmp is a fine place to put things on our system.) + +### 6 + +There are 0.91B unique SANs in the 25.2GiB dataset (6.1GiB compressed): + + $ du -shb data/archive/2023-04-03-ct-sans + 27050799992 data/archive/2023-04-03-ct-sans + $ python3 -c "print(f'{27050799992 / 1024**3:.1f}GiB')" + 25.2GiB + $ du -shb data/archive/2023-04-03-ct-sans.zip + 6526876407 data/archive/2023-04-03-ct-sans.zip + $ python3 -c "print(f'{6526876407 / 1024**3:.1f}GiB')" + 6.1GiB + $ wc -l data/archive/2023-04-03-ct-sans/sans.lst + 907332515 data/archive/2023-04-03-ct-sans/sans.lst + $ python3 -c "print(f'{907332515 / 1000**3:.2f}B')" + 0.91B + +These SANs were found in 3.74B certificates from 17 CT logs: + + $ grep "In total," data/archive/2023-04-03-ct-sans/README.md + In total, 3743244652 certificates were downloaded from 17 CT logs; + $ python3 -c "print(f'{3743244652 / 1000**3:.2f}B')" + 3.74B -- cgit v1.2.3