From 9023b4e3fe70ada7d466589e753e69d13573c157 Mon Sep 17 00:00:00 2001 From: Rasmus Dahlberg Date: Thu, 23 Mar 2023 13:15:06 +0100 Subject: Add proper README for the ct-sans tool only --- README.md | 277 ++++++++++++++++++++++++++++---------------------------------- 1 file changed, 123 insertions(+), 154 deletions(-) diff --git a/README.md b/README.md index 7941e20..f89fd7e 100644 --- a/README.md +++ b/README.md @@ -1,168 +1,137 @@ -# The Onion-Location world +# ct-sans -This document describes how to estimate the size of the [Onion-Location][] -world. The intuition for obtaining an answer is as follows: +A tool that downloads certificates from [CT logs][] [recognized by Google +Chrome][], storing the encountered [Subject Alternative Names (SANs)][] to disk. +The final data set `sans.lst` is de-duplicated and contains one SAN per line. - 1. Onion-Location requires HTTPS. Therefore, a pretty complete list of - domains that _may_ offer Onion-Location can be determined by downloading - all [CT-logged certificates][] and checking which [SANs][] are in them. - 2. Visit the encountered SANs over HTTPS without Tor, looking for if the web - server set either the Onion-Location HTTP header or HTML meta-tag. +[CT logs]: https://certificate.transparency.dev/ +[recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto/ +[Subject Alternative Names (SANs)]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6/ -Please note that this is a _lower-bound estimate_, e.g., because not all web -browsers enforce CT logging and SANs like `*.example.com` may have any number of -subdomains with their own Onion-Location configured sites that we won't find. +**Warning:** research prototype. The source code may also be moved. -We start by describing the experimental setup and tools used, followed by -results as well as availability of the collected and derived datasets. +## Quick start -[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/ -[CT-logged certificates]: https://certificate.transparency.dev/ -[SANs]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6 +You will need a Go compiler and GNU sort on the local system: -## Experimental setup + $ which go || echo "Go compiler is not in $PATH" + $ which sort || echo "GNU sort is not in $PATH" -XXX: system(s) used to run the below. +Install `ct-sans`: -### ct-sans dataset + $ go install git.cs.kau.se/rasmoste/ct-sans@latest + $ which ct-sans || echo "ct-sans is not in $PATH" -We put together a tool named [ct-sans][] that facilitates simple creation of a -data set composed of unique SANs in [CT logs recognized by Google Chrome][]. +Download and verify the signature of Google's list of known logs, +then download and verify the signatures of the logs' tree heads: -[ct-sans]: XXX -[CT logs recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto + $ ct-sans snapshot -d $HOME/ct-sans-demo + 2023/03/23 12:43:49 cmd_snapshot.go:30: INFO: updating metadata file + 2023/03/23 12:43:49 cmd_snapshot.go:47: INFO: updating signed tree heads + 2023/03/23 12:43:49 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2023' log at tree size 862104911 + 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2024' log at tree size 55767940 + 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2023' log at tree size 990277299 + 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2024' log at tree size 66655425 + 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2023' Log at tree size 527018586 + 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2024' Log at tree size 34050592 + 2023/03/23 12:43:51 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2024 Log at tree size 38426463 + 2023/03/23 12:43:53 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2025 Log at tree size 697 + 2023/03/23 12:43:54 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2023 Log at tree size 200387219 + 2023/03/23 12:43:55 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2024 Log at tree size 40017666 + 2023/03/23 12:43:55 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2025 Log at tree size 704 + 2023/03/23 12:43:56 cmd_snapshot.go:82: INFO: bootstrapped Sectigo 'Sabre' CT log at tree size 229064032 + 2023/03/23 12:43:57 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2023' log at tree size 467618545 + 2023/03/23 12:43:57 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H1' log at tree size 34451205 + 2023/03/23 12:43:57 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H2' log at tree size 14680 + 2023/03/23 12:43:59 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2023 at tree size 388349 + 2023/03/23 12:44:01 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2024-2 at tree size 112771 -Install: +Subsequent uses of the snapshot command will update the signed list of known +logs, then update the logs' signed tree heads after verifying consistency. - $ go install ...ct-sans@latest - ... - -Make a snapshot of which CT logs and entries to download: - - $ ct-sans snapshot -d $HOME/.config/ct-sans - ... - -Collect, or continue collecting if you decided to shutdown prematurely: - - $ ct-sans collect -d $HOME/.config/ct-sans - ... - -The collected data is per-log, with each line containing a single SAN: - - $ tail -n3 $HOME/.config/ct-sans/logs/xxx/sans.lst - example.org - www.example.org - *.example.net - -[Motivation: easy to maintain the data set, e.g., without needing any special -indexing and can yearly delete log shards that only contain expired -certificates with rm -rf.] - -The final data set of combined and non-duplicate SANs can be created with the -UNIX tool `sort`. For the exact commands and an associated dataset manifest: - - $ ct-sans package -d $HOME/.config/ct-sans - sort -u ... - ... - - sort -Vuo sans.lst --buffer-size=1024K --temporary-directory=/tmp/t --parallel=2 a.lst b.lst - -Note that you may need to tweak the number of CPUs, available memory, and -temporary disk space to be suitable for your own system. - -### zgrab2 website visits - -[ZGrab2][] is an application-level scanner that (among other things) can visit -HTTPS sites to record encountered HTTP headers and web pages. This is exactly -what we need to visit each site in our ct-sans dataset, however only saving the -output if it indicates that the site is configured with Onion-Location. - -Install: - - $ go install github.com/zmap/zgrab2/cmd/zgrab2@latest - ... - -Run: - - $ zgrab ... +Download and verify the logs' Merkle trees up until the current snapshot: -XXX: describe the grep pattern for filtering, and/or wrap in a bash script. - -[ZGrab2]: https://github.com/zmap/zgrab2 - -### Results and data sets - -XXX - -### Remarks - - - The ct-sans dataset can be updated by running the snapshot, collect, and - assemble commands again. (The snapshot command downloads the latest list of - CT logs and their signed tree heads to use as reference while collecting.) - - The entire zgrab2 scan needs to be conduced from scratch for each dataset - because sites may add or remove Onion-Location configurations at any time. - - The above does not make any attempts to visit the announced onion sites. - -### Santity check - -We will download ~3.4 * 10^9 certificates in total. - -We only store SANs, not complete certificates. Assume that each certificate has -on average 256 bytes of SANs (1/6 of avg certificate size). Then: - - 256 * 3.4 * 10^9 = 0.8 TiB of SANs. - -We will also need temp disk space for sorting and removing duplicates; so if we -could get a machine with 2TiB disk that should probably be more than enough. - -The data needed to be stored after website visits will be ~negligible. - -The more RAM and CPU workers we can get, the better. Same with bandwidth. For -running this more continuously in the future, a less powerful machine should do. - -XXX: Tobias will request a machine from our department tomorrow, minimum 8 CPUs -and 32GiB RAM and ~2TiB disk. Pitch easier if we do website visits with -mullvad enabled, so we will do that. - -### Notes from starting ct-sans on our machine - - $ go install git.cs.kau.se/rasmoste/ct-sans@v0.0.1 - $ ct-sans snapshot >snapshot.stdout - $ cat snapshot.stdout - 2023/03/18 20:05:30 cmd_snapshot.go:30: INFO: updating metadata file - 2023/03/18 20:05:30 cmd_snapshot.go:47: INFO: updating signed tree heads - 2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2023' log at tree size 841710936 - 2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2024' log at tree size 52689060 - 2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2023' log at tree size 966608751 - 2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2024' log at tree size 63025768 - 2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2023' Log at tree size 513025681 - 2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2024' Log at tree size 31749516 - 2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2024 Log at tree size 36293063 - 2023/03/18 20:05:32 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2025 Log at tree size 687 - 2023/03/18 20:05:33 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2023 Log at tree size 198950777 - 2023/03/18 20:05:33 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2024 Log at tree size 37773373 - 2023/03/18 20:05:34 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2025 Log at tree size 694 - 2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Sectigo 'Sabre' CT log at tree size 228782818 - 2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2023' log at tree size 457590228 - 2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H1' log at tree size 32863293 - 2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H2' log at tree size 13281 - 2023/03/18 20:05:36 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2023 at tree size 379756 - 2023/03/18 20:05:38 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2024-2 at tree size 110270 - $ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr - -Tail in a tmux pane: - - $ tail -f collect.stdout + $ ct-sans collect -d $HOME/ct-sans-demo ... - $ head -n1 collect.stdout - 2023/03/18 20:11:52 cmd_collect.go:150: INFO: collect is up-and-running, ctrl+C to exit - -Note that it is safe to ctrl+C, things are written to disk before exit. Just -run `ct-sans collect` again to start collecting from where we left off. - -Tail in a tmux pane: - - $ tail -f collect.stderr - ... - -With the above settings it takes 60m before the first status report is printed. + INFO: status update before shutdown + + Google 'Argon2023' log | 162.5 entries/s | Estimated done in 1474.01 hours | Working on [11776, 862104911) + Google 'Argon2024' log | 157.5 entries/s | Estimated done in 98.31 hours | Working on [11584, 55767940) + Google 'Xenon2023' log | 472.6 entries/s | Estimated done in 582.01 hours | Working on [33888, 990277299) + Google 'Xenon2024' log | 458.5 entries/s | Estimated done in 40.37 hours | Working on [32896, 66655425) + Cloudflare 'Nimbus2023' Log | 276.1 entries/s | Estimated done in 530.24 hours | Working on [19328, 527018586) + Cloudflare 'Nimbus2024' Log | 301.2 entries/s | Estimated done in 31.39 hours | Working on [20736, 34050592) + DigiCert Yeti2024 Log | 379.1 entries/s | Estimated done in 28.14 hours | Working on [27520, 38426463) + DigiCert Yeti2025 Log | 0.0 entries/s | Estimated done in 0.00 hours | Working on [697, 697) + DigiCert Nessie2023 Log | 331.3 entries/s | Estimated done in 168.00 hours | Working on [23040, 200387219) + DigiCert Nessie2024 Log | 329.8 entries/s | Estimated done in 33.68 hours | Working on [21120, 40017666) + DigiCert Nessie2025 Log | 0.0 entries/s | Estimated done in 0.00 hours | Working on [704, 704) + Sectigo 'Sabre' CT log | 275.7 entries/s | Estimated done in 230.78 hours | Working on [19456, 229064032) + Let's Encrypt 'Oak2023' log | 462.8 entries/s | Estimated done in 280.67 hours | Working on [33664, 467618545) + Let's Encrypt 'Oak2024H1' log | 121.4 entries/s | Estimated done in 78.79 hours | Working on [5248, 34451205) + Let's Encrypt 'Oak2024H2' log | 0.0 entries/s | Estimated done in 0.00 hours | Working on [14680, 14680) + Trust Asia Log2023 | 215.8 entries/s | Estimated done in 0.48 hours | Working on [15872, 388349) + Trust Asia Log2024-2 | 246.2 entries/s | Estimated done in 0.11 hours | Working on [17664, 112771) + +This will take a while depending on the local system, configuration of the +optional `ct-sans collect` flags, as well as how heavily the logs apply +rate-limits. For good performance while respecting rate-limits, you may want +to try `--workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m`. This +allowed us to download the logs (March 2023) in approximately 10 days. Our +single-IP EU machine had a 2 TiB SSD, 64 GiB memory, 16 CPU cores, and 1 Gbps. + +Of note is that it is safe to ctrl+C while collecting. Just wait for the +collect command to exit on its own so that things are persisted to disk. + +Once the collect phase is done, assemble the data set: + + $ echo "for demo-purposes, only Nessie2025 and Oak2024H2 are shown below"^C + $ ct-sans assemble -d $HOME/ct-sans-demo + 2023/03/23 13:05:12 cmd_assemble.go:54: INFO: merging and de-duplicating 2 input files with GNU sort + 2023/03/23 13:05:12 cmd_assemble.go:67: INFO: created /home/rgdd/ct-sans-demo/archive/2023-03-23-ct-sans/sans.lst (0.3 MiB) + 2023/03/23 13:05:12 cmd_assemble.go:69: INFO: adding notice file + 2023/03/23 13:05:12 cmd_assemble.go:87: INFO: adding README + 2023/03/23 13:05:12 cmd_assemble.go:96: INFO: adding signed metadata file + 2023/03/23 13:05:12 cmd_assemble.go:108: INFO: adding signed tree heads + 2023/03/23 13:05:12 cmd_assemble.go:117: INFO: uncompressed dataset available in /home/rgdd/ct-sans-demo/archive/2023-03-23-ct-sans + $ cat $HOME/ct-sans-demo/archive/2023-03-23-ct-sans/README.md + # ct-sans dataset + + Dataset assembled at Thu Mar 23 13:05:12 CET 2023. Contents: + + - README.md + - metadata.json + - metadata.sig + - sths.json + - notice.txt + - sans.lst + + The signed [metadata file][] and tree heads were downloaded at + Thu Mar 23 12:43:49 CET 2023. + + [metadata file]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto + + In total, 15377 certificates were downloaded from 2 CT logs; + 0 certificates contained SANs that could not be parsed. + For more information about these errors, see notice.txt. + + The SANs data set is sorted and de-duplicated, one SAN per line. + +**Note:** the different ct-sans commands must not run at the same time. + +## Updating the data set + +Simply run the same snapshot, collect, and assemble commands again. + +## Contact + + - IRC: room #certificate-transparency at [OFTC.net][] + - Matrix: room [#certificate-transparency][] (bridged with IRC) + - Email: rasmus (at) rgdd (dot) se + +[OFTC.net]: https://www.oftc.net/ +[#certificate-transparency]: https://app.element.io/#/room/#sauteed-onions:matrix.org/ + +## Licence + +BSD 2-Clause License -- cgit v1.2.3