aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorRasmus Dahlberg <rasmus@rgdd.se>2023-04-07 18:00:35 +0200
committerRasmus Dahlberg <rasmus@rgdd.se>2023-04-07 18:00:35 +0200
commite4d01585d9802a256d754072bdce2b855ae7d354 (patch)
tree19fd64c9f5e23b75f38bf6f26434be05530ffd64
parent85182f9a1007c46979a8f3be4acf165b27444d04 (diff)
Add operations timeline
-rw-r--r--README.md41
-rw-r--r--docs/operations.md218
2 files changed, 244 insertions, 15 deletions
diff --git a/README.md b/README.md
index 830c024..1017c02 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,9 @@
A tool that downloads certificates from [CT logs][] [recognized by Google
Chrome][], storing the encountered [Subject Alternative Names (SANs)][] to disk.
-The final data set `sans.lst` is de-duplicated and contains one SAN per line.
+The dataset can be assembled so that it is de-duplicated with one SAN per line.
+
+**Availability:** [2023-04-03-ct-sans dataset](./docs/operations.md)
[CT logs]: https://certificate.transparency.dev/
[recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto/
@@ -12,15 +14,23 @@ The final data set `sans.lst` is de-duplicated and contains one SAN per line.
## Quick start
-You will need a Go compiler and GNU sort on the local system:
- $ which go || echo "Go compiler is not in $PATH"
- $ which sort || echo "GNU sort is not in $PATH"
+### Install
+
+You will need a [Go compiler][] and [GNU sort][] on the local system:
+
+ $ which go >/dev/null || echo "Go compiler not PATH"
+ $ which sort >/dev/null || echo "GNU sort not PATH"
+
+[Go compiler]: https://go.dev/doc/install
+[GNU sort]: https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
Install `ct-sans`:
$ go install git.cs.kau.se/rasmoste/ct-sans@latest
- $ which ct-sans || echo "ct-sans is not in $PATH"
+ $ which ct-sans >/dev/null || echo "ct-sans not in PATH"
+
+### Snapshot
Download and verify the signature of Google's list of known logs,
then download and verify the signatures of the logs' tree heads:
@@ -49,6 +59,8 @@ then download and verify the signatures of the logs' tree heads:
Subsequent uses of the `snapshot` command will update the signed list of known
logs, then update the logs' signed tree heads after verifying consistency.
+### Collect
+
Download and verify the logs' Merkle trees up until the current snapshot:
$ ct-sans collect -d $HOME/ct-sans-demo
@@ -75,13 +87,10 @@ Download and verify the logs' Merkle trees up until the current snapshot:
This will take a while depending on the local system, configuration of the
optional `collect` flags, as well as how heavily the logs apply rate-limits.
-For good performance while respecting rate-limits, you may want to try
-`--workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m`. This allowed
-us to download the logs (March 2023) in approximately 10 days. Our machine was
-located in EU with 2TiB SSD, 64GiB memory, 16 CPU cores, and 1Gbps line-speed.
+For reference, we [downloaded the logs](./docs/operations.md) from scratch in
+less than 11 days using a single-IP machine that respects the logs' rate-limits.
-Of note is that it is safe to ctrl+C while collecting. Just wait for the
-`collect` command to exit on its own so that things are persisted to disk.
+### Assemble
Once the collect phase is done, assemble the data set:
@@ -117,11 +126,13 @@ Once the collect phase is done, assemble the data set:
The SANs data set is sorted and de-duplicated, one SAN per line.
-**Note:** the different `ct-sans` commands must not run at the same time.
-
-## Updating the data set
+### Good to know
-Simply run the same `snapshot`, `collect`, and `assemble` commands again.
+ - It is safe to ctrl+C while collecting. Just wait for the `collect` command
+ to exit on its own so that things are persisted to disk.
+ - The different `ct-sans` commands must not run at the same time.
+ - The dataset can be updated by running the same `snapshot`, `collect` and
+ `assemble` commands again.
## Contact
diff --git a/docs/operations.md b/docs/operations.md
new file mode 100644
index 0000000..458ec13
--- /dev/null
+++ b/docs/operations.md
@@ -0,0 +1,218 @@
+# Operations
+
+This document describes our ct-sans data collection, including information about
+the local system and a timeline leading up to assembling the 2023-04-03 dataset.
+
+## Summary
+
+The initial download time for the current CT logs was 11 days (March 2023). The
+time to assemble the final dataset of 0.91B unique SANs (25.2GiB) was 6 hours.
+
+The assembled data set can be downloaded [here](TODO).
+
+## Local system
+
+We're running Ubuntu in a VM:
+
+ $ lsb_release -a
+ No LSB modules are available.
+ Distributor ID: Ubuntu
+ Description: Ubuntu 22.04.2 LTS
+ Release: 22.04
+ Codename: jammy
+
+Our VM is configured with 62.9GiB RAM, one CPU core with 32 CPU threads, and a
+~2TiB SSD:
+
+ $ grep MemTotal /proc/meminfo
+ processor /proc/cpuinfoemTotal: 65948412 keand
+ $ grep -c processor /proc/cpuinfo
+ 32
+ $ grep 'cpu cores' /proc/cpuinfo | uniq
+ cpu cores : 1
+ $ df -BG /home
+ Filesystem 1G-blocks Used Available Use% Mounted on
+ /dev/mapper/ubuntu--vg-ubuntu--lv 2077G 220G 1772G 12% /
+
+This VM shares a 1x10Gbps link with other network VMs that we have no control
+over. We installed `vnstat` to track our own bandwidth-usage over time:
+
+ # apt install vnstat
+ # systemctl enable vnstat.service
+ # systemctl start vnstat.service
+
+We also installed Go version 1.20, see [install instructions][]:
+
+ $ go version
+ go version go1.20.2 linux/amd64
+
+[install instructions]: https://go.dev/doc/install
+
+The versions of `git.cs.kau.se/rasmoste/ct-sans@VERSION` are listed below.
+
+## Timeline
+
+| date | time (UTC) | event | notes |
+| ---------- | ---------- | --------------------------- | ------------------------------------- |
+| 2023/03/18 | 20:05:30 | snapshot and start collect | running v0.0.1, see command notes [1] |
+| 2023/03/27 | 14:53:59 | stop collect, bump version | install v0.0.2, see migrate notes [2] |
+| 2023/03/27 | 15:03:12 | start collect again | mainly waiting for Argon2023 now [3] |
+| 2023/03/29 | 10:22:24 | collect completed | |
+| 2023/03/29 | 15:46:44 | snapshot and collect again | download backlog from last 10 days |
+| 2023/03/30 | 05:52:38 | collect completed | |
+| 2023/03/30 | 08:58:50 | snapshot and collect again | download backlog from last ~16 hours |
+| 2023/03/30 | 09:53:34 | collect completed | bandwidth usage statistics [4] |
+| 2023/03/30 | 10:05:40 | start assemble | still running v0.0.2 [5] |
+| 2023/03/30 | 16:06:39 | assemble done | 0.9B sans (25GiB, 7GiB zipped in 15m) |
+| 2023/04/02 | 23:31:37 | snapshot and collect again | download backlog, again |
+| 2023/04/03 | 03:54:18 | collect completed | |
+| 2023/04/03 | 08:52:28 | snapshot and collect again | final before assembling for real use |
+| 2023/04/03 | 09:22:22 | collect completed | |
+| 2023/04/03 | 09:30:00 | start assemble | [5] |
+| 2023/04/03 | 16:12:38 | assemble done | 0.91B SANs (25.2GiB) from 3.74B certs |
+
+## Notes
+
+### 1
+
+ $ ct-sans snapshot >snapshot.stdout
+ $ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr
+
+### 2
+
+In addition to adding the assemble command, `v0.0.2` stores notice.txt files in
+each log's directory automatically. This ensures that the output in stdout can
+be discarded as opposed to being stored and managed manually in the long run
+(e.g., grep for NOTICE prints when assembling data sets).
+
+Commit `ad9fb49670e28414637761bac4b8e8940e2d6770` includes a Go program that
+transforms an existing `collect.stderr` file to `notice.txt` files.
+
+Steps to migrate:
+
+ - [x] Stop (ctrl+c, wait)
+ - [x] Move collect.{stdout,stderr} to data/notes/
+ - [x] `grep NOTICE data/notes/collect.stdout | wc -l` gives 6919 lines
+ - [x] run the program in the above commit with the appropriate `directory` and
+ `noticeFile` paths. See output below.
+ - [x] `wc -l $(find . -name notice.txt) -> total says 6919 lines
+ - [x] go install git.cs.kau.se/rasmoste/ct-sans@latest, downloaded v0.0.2
+ - [x] run the same collect command as in note (1); this will not overwrite the
+ previous collect files because they have been moved to data/notes/. In the
+ future we will not need to store any of this, but doing it now just in case
+ something goes wrong.
+ - [x] The only two logs that had entries left to download resumed
+
+Output from migrate program and santity check:
+
+ $ go run .
+ 2023/03/27 14:57:41 Google 'Argon2023' log: 608 notices
+ 2023/03/27 14:57:41 Google 'Argon2024' log: 101 notices
+ 2023/03/27 14:57:41 Google 'Xenon2023' log: 2119 notices
+ 2023/03/27 14:57:41 Google 'Xenon2024' log: 170 notices
+ 2023/03/27 14:57:41 Cloudflare 'Nimbus2023' Log: 2194 notices
+ 2023/03/27 14:57:41 Cloudflare 'Nimbus2024' Log: 164 notices
+ 2023/03/27 14:57:41 DigiCert Yeti2024 Log: 17 notices
+ 2023/03/27 14:57:41 DigiCert Yeti2025 Log: no notices
+ 2023/03/27 14:57:41 DigiCert Nessie2023 Log: 155 notices
+ 2023/03/27 14:57:41 DigiCert Nessie2024 Log: 19 notices
+ 2023/03/27 14:57:41 DigiCert Nessie2025 Log: no notices
+ 2023/03/27 14:57:41 Sectigo 'Sabre' CT log: 1140 notices
+ 2023/03/27 14:57:41 Let's Encrypt 'Oak2023' log: 156 notices
+ 2023/03/27 14:57:41 Let's Encrypt 'Oak2024H1' log: 14 notices
+ 2023/03/27 14:57:41 Let's Encrypt 'Oak2024H2' log: no notices
+ 2023/03/27 14:57:41 Trust Asia Log2023: 62 notices
+ 2023/03/27 14:57:41 Trust Asia Log2024-2: no notices
+ $ wc -l $(find . -name notice.txt)
+ 101 ./data/logs/eecdd064d5db1acec55cb79db4cd13a23287467cbcecdec351485946711fb59b/notice.txt
+ 14 ./data/logs/3b5377753e2db9804e8b305b06fe403b67d84fc3f4c7bd000d2d726fe1fad417/notice.txt
+ 155 ./data/logs/b3737707e18450f86386d605a9dc11094a792db1670c0b87dcf0030e7936a59a/notice.txt
+ 62 ./data/logs/e87ea7660bc26cf6002ef5725d3fe0e331b9393bb92fbf58eb3b9049daf5435a/notice.txt
+ 164 ./data/logs/dab6bf6b3fb5b6229f9bc2bb5c6be87091716cbb51848534bda43d3048d7fbab/notice.txt
+ 608 ./data/logs/e83ed0da3ef5063532e75728bc896bc903d3cbd1116beceb69e1777d6d06bd6e/notice.txt
+ 156 ./data/logs/b73efb24df9c4dba75f239c5ba58f46c5dfc42cf7a9f35c49e1d098125edb499/notice.txt
+ 1140 ./data/logs/5581d4c2169036014aea0b9b573c53f0c0e43878702508172fa3aa1d0713d30c/notice.txt
+ 19 ./data/logs/73d99e891b4c9678a0207d479de6b2c61cd0515e71192a8c6b80107ac17772b5/notice.txt
+ 170 ./data/logs/76ff883f0ab6fb9551c261ccf587ba34b4a4cdbb29dc68420a9fe6674c5a3a74/notice.txt
+ 2119 ./data/logs/adf7befa7cff10c88b9d3d9c1e3e186ab467295dcfb10c24ca858634ebdc828a/notice.txt
+ 2194 ./data/logs/7a328c54d8b72db620ea38e0521ee98416703213854d3bd22bc13a57a352eb52/notice.txt
+ 17 ./data/logs/48b0e36bdaa647340fe56a02fa9d30eb1c5201cb56dd2c81d9bbbfab39d88473/notice.txt
+ 6919 total
+
+### 3
+
+For some reason Nimbus2023 is stuck at
+
+ {"tree_size":512926523,"RootHash":[41,19,83,107,69,253,233,106,68,143,173,151,177,196,60,228,22,57,246,105,184,51,24,50,230,153,233,189,214,93,132,186]}
+
+while trying to fetch until
+
+ {"sth_version":0,"tree_size":513025681,"timestamp":1679169572616,"sha256_root_hash":"0SzzS0M2RP5BHC6M9bvOPySYJadPi9nnk2Dsav4NKKs=","tree_head_signature":"BAMARjBEAiBXrmT+W2Ct+32DX/XL+YwS9Ut4rnOG6Y+A4Lxbf/6TogIgYEM32vweDC0QStwMq1PzIvm97cQhj6bUSdZWq/wMkNw=","log_id":"ejKMVNi3LbYg6jjgUh7phBZwMhOFTTvSK8E6V6NS61I="}
+
+These tree heads are not inconsistent, and a restart should resolve the problem.
+
+(There is likely a corner-case somewhere that made the fetcher exit or halt. We
+should debug this further at some point; but have not happened more than once.)
+
+### 4
+
+Quick overview:
+
+ $ vnstat -d
+ ens160 / daily
+
+ day rx | tx | total | avg. rate
+ ------------------------+-------------+-------------+---------------
+ 2023-03-18 1.49 TiB | 17.07 GiB | 1.51 TiB | 153.44 Mbit/s
+ 2023-03-19 3.77 TiB | 41.21 GiB | 3.81 TiB | 387.83 Mbit/s
+ 2023-03-20 3.09 TiB | 36.67 GiB | 3.13 TiB | 318.26 Mbit/s
+ 2023-03-21 3.11 TiB | 32.24 GiB | 3.14 TiB | 319.61 Mbit/s
+ 2023-03-22 2.08 TiB | 25.98 GiB | 2.10 TiB | 213.89 Mbit/s
+ 2023-03-23 1.16 TiB | 15.59 GiB | 1.18 TiB | 119.97 Mbit/s
+ 2023-03-24 1.17 TiB | 15.44 GiB | 1.18 TiB | 120.44 Mbit/s
+ 2023-03-25 1.18 TiB | 15.72 GiB | 1.19 TiB | 121.55 Mbit/s
+ 2023-03-26 707.47 GiB | 9.64 GiB | 717.11 GiB | 71.30 Mbit/s
+ 2023-03-27 448.80 GiB | 6.43 GiB | 455.23 GiB | 45.26 Mbit/s
+ 2023-03-28 451.49 GiB | 6.49 GiB | 457.98 GiB | 45.53 Mbit/s
+ 2023-03-29 1.01 TiB | 12.73 GiB | 1.03 TiB | 104.45 Mbit/s
+ 2023-03-30 256.75 GiB | 3.40 GiB | 260.15 GiB | 59.59 Mbit/s
+ ------------------------+-------------+-------------+---------------
+ estimated 591.55 GiB | 7.84 GiB | 599.39 GiB |
+
+### 5
+
+Use at most 58GiB RAM for sorting, 8 parallel sort workers. More than this does
+not improve performance according to the [GNU sort manual][]. We're also
+setting the `LC_ALL=C` variable to ensure consistent sort order (see man).
+
+ $ export LC_ALL=C
+ $ ct-sans assemble -b 58 -p 8 >assemble.stdout
+
+[GNU sort manual]: https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
+
+(We don't need to change the default directories, because the collected data is
+stored in ./data and /tmp is a fine place to put things on our system.)
+
+### 6
+
+There are 0.91B unique SANs in the 25.2GiB dataset (6.1GiB compressed):
+
+ $ du -shb data/archive/2023-04-03-ct-sans
+ 27050799992 data/archive/2023-04-03-ct-sans
+ $ python3 -c "print(f'{27050799992 / 1024**3:.1f}GiB')"
+ 25.2GiB
+ $ du -shb data/archive/2023-04-03-ct-sans.zip
+ 6526876407 data/archive/2023-04-03-ct-sans.zip
+ $ python3 -c "print(f'{6526876407 / 1024**3:.1f}GiB')"
+ 6.1GiB
+ $ wc -l data/archive/2023-04-03-ct-sans/sans.lst
+ 907332515 data/archive/2023-04-03-ct-sans/sans.lst
+ $ python3 -c "print(f'{907332515 / 1000**3:.2f}B')"
+ 0.91B
+
+These SANs were found in 3.74B certificates from 17 CT logs:
+
+ $ grep "In total," data/archive/2023-04-03-ct-sans/README.md
+ In total, 3743244652 certificates were downloaded from 17 CT logs;
+ $ python3 -c "print(f'{3743244652 / 1000**3:.2f}B')"
+ 3.74B