aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--README.md277
1 files changed, 123 insertions, 154 deletions
diff --git a/README.md b/README.md
index 7941e20..f89fd7e 100644
--- a/README.md
+++ b/README.md
@@ -1,168 +1,137 @@
-# The Onion-Location world
+# ct-sans
-This document describes how to estimate the size of the [Onion-Location][]
-world. The intuition for obtaining an answer is as follows:
+A tool that downloads certificates from [CT logs][] [recognized by Google
+Chrome][], storing the encountered [Subject Alternative Names (SANs)][] to disk.
+The final data set `sans.lst` is de-duplicated and contains one SAN per line.
- 1. Onion-Location requires HTTPS. Therefore, a pretty complete list of
- domains that _may_ offer Onion-Location can be determined by downloading
- all [CT-logged certificates][] and checking which [SANs][] are in them.
- 2. Visit the encountered SANs over HTTPS without Tor, looking for if the web
- server set either the Onion-Location HTTP header or HTML meta-tag.
+[CT logs]: https://certificate.transparency.dev/
+[recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto/
+[Subject Alternative Names (SANs)]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6/
-Please note that this is a _lower-bound estimate_, e.g., because not all web
-browsers enforce CT logging and SANs like `*.example.com` may have any number of
-subdomains with their own Onion-Location configured sites that we won't find.
+**Warning:** research prototype. The source code may also be moved.
-We start by describing the experimental setup and tools used, followed by
-results as well as availability of the collected and derived datasets.
+## Quick start
-[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/
-[CT-logged certificates]: https://certificate.transparency.dev/
-[SANs]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6
+You will need a Go compiler and GNU sort on the local system:
-## Experimental setup
+ $ which go || echo "Go compiler is not in $PATH"
+ $ which sort || echo "GNU sort is not in $PATH"
-XXX: system(s) used to run the below.
+Install `ct-sans`:
-### ct-sans dataset
+ $ go install git.cs.kau.se/rasmoste/ct-sans@latest
+ $ which ct-sans || echo "ct-sans is not in $PATH"
-We put together a tool named [ct-sans][] that facilitates simple creation of a
-data set composed of unique SANs in [CT logs recognized by Google Chrome][].
+Download and verify the signature of Google's list of known logs,
+then download and verify the signatures of the logs' tree heads:
-[ct-sans]: XXX
-[CT logs recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto
+ $ ct-sans snapshot -d $HOME/ct-sans-demo
+ 2023/03/23 12:43:49 cmd_snapshot.go:30: INFO: updating metadata file
+ 2023/03/23 12:43:49 cmd_snapshot.go:47: INFO: updating signed tree heads
+ 2023/03/23 12:43:49 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2023' log at tree size 862104911
+ 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2024' log at tree size 55767940
+ 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2023' log at tree size 990277299
+ 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2024' log at tree size 66655425
+ 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2023' Log at tree size 527018586
+ 2023/03/23 12:43:50 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2024' Log at tree size 34050592
+ 2023/03/23 12:43:51 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2024 Log at tree size 38426463
+ 2023/03/23 12:43:53 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2025 Log at tree size 697
+ 2023/03/23 12:43:54 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2023 Log at tree size 200387219
+ 2023/03/23 12:43:55 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2024 Log at tree size 40017666
+ 2023/03/23 12:43:55 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2025 Log at tree size 704
+ 2023/03/23 12:43:56 cmd_snapshot.go:82: INFO: bootstrapped Sectigo 'Sabre' CT log at tree size 229064032
+ 2023/03/23 12:43:57 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2023' log at tree size 467618545
+ 2023/03/23 12:43:57 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H1' log at tree size 34451205
+ 2023/03/23 12:43:57 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H2' log at tree size 14680
+ 2023/03/23 12:43:59 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2023 at tree size 388349
+ 2023/03/23 12:44:01 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2024-2 at tree size 112771
-Install:
+Subsequent uses of the snapshot command will update the signed list of known
+logs, then update the logs' signed tree heads after verifying consistency.
- $ go install ...ct-sans@latest
- ...
-
-Make a snapshot of which CT logs and entries to download:
-
- $ ct-sans snapshot -d $HOME/.config/ct-sans
- ...
-
-Collect, or continue collecting if you decided to shutdown prematurely:
-
- $ ct-sans collect -d $HOME/.config/ct-sans
- ...
-
-The collected data is per-log, with each line containing a single SAN:
-
- $ tail -n3 $HOME/.config/ct-sans/logs/xxx/sans.lst
- example.org
- www.example.org
- *.example.net
-
-[Motivation: easy to maintain the data set, e.g., without needing any special
-indexing and can yearly delete log shards that only contain expired
-certificates with rm -rf.]
-
-The final data set of combined and non-duplicate SANs can be created with the
-UNIX tool `sort`. For the exact commands and an associated dataset manifest:
-
- $ ct-sans package -d $HOME/.config/ct-sans
- sort -u ...
- ...
-
- sort -Vuo sans.lst --buffer-size=1024K --temporary-directory=/tmp/t --parallel=2 a.lst b.lst
-
-Note that you may need to tweak the number of CPUs, available memory, and
-temporary disk space to be suitable for your own system.
-
-### zgrab2 website visits
-
-[ZGrab2][] is an application-level scanner that (among other things) can visit
-HTTPS sites to record encountered HTTP headers and web pages. This is exactly
-what we need to visit each site in our ct-sans dataset, however only saving the
-output if it indicates that the site is configured with Onion-Location.
-
-Install:
-
- $ go install github.com/zmap/zgrab2/cmd/zgrab2@latest
- ...
-
-Run:
-
- $ zgrab ...
+Download and verify the logs' Merkle trees up until the current snapshot:
-XXX: describe the grep pattern for filtering, and/or wrap in a bash script.
-
-[ZGrab2]: https://github.com/zmap/zgrab2
-
-### Results and data sets
-
-XXX
-
-### Remarks
-
- - The ct-sans dataset can be updated by running the snapshot, collect, and
- assemble commands again. (The snapshot command downloads the latest list of
- CT logs and their signed tree heads to use as reference while collecting.)
- - The entire zgrab2 scan needs to be conduced from scratch for each dataset
- because sites may add or remove Onion-Location configurations at any time.
- - The above does not make any attempts to visit the announced onion sites.
-
-### Santity check
-
-We will download ~3.4 * 10^9 certificates in total.
-
-We only store SANs, not complete certificates. Assume that each certificate has
-on average 256 bytes of SANs (1/6 of avg certificate size). Then:
-
- 256 * 3.4 * 10^9 = 0.8 TiB of SANs.
-
-We will also need temp disk space for sorting and removing duplicates; so if we
-could get a machine with 2TiB disk that should probably be more than enough.
-
-The data needed to be stored after website visits will be ~negligible.
-
-The more RAM and CPU workers we can get, the better. Same with bandwidth. For
-running this more continuously in the future, a less powerful machine should do.
-
-XXX: Tobias will request a machine from our department tomorrow, minimum 8 CPUs
-and 32GiB RAM and ~2TiB disk. Pitch easier if we do website visits with
-mullvad enabled, so we will do that.
-
-### Notes from starting ct-sans on our machine
-
- $ go install git.cs.kau.se/rasmoste/ct-sans@v0.0.1
- $ ct-sans snapshot >snapshot.stdout
- $ cat snapshot.stdout
- 2023/03/18 20:05:30 cmd_snapshot.go:30: INFO: updating metadata file
- 2023/03/18 20:05:30 cmd_snapshot.go:47: INFO: updating signed tree heads
- 2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2023' log at tree size 841710936
- 2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2024' log at tree size 52689060
- 2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2023' log at tree size 966608751
- 2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2024' log at tree size 63025768
- 2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2023' Log at tree size 513025681
- 2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2024' Log at tree size 31749516
- 2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2024 Log at tree size 36293063
- 2023/03/18 20:05:32 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2025 Log at tree size 687
- 2023/03/18 20:05:33 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2023 Log at tree size 198950777
- 2023/03/18 20:05:33 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2024 Log at tree size 37773373
- 2023/03/18 20:05:34 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2025 Log at tree size 694
- 2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Sectigo 'Sabre' CT log at tree size 228782818
- 2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2023' log at tree size 457590228
- 2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H1' log at tree size 32863293
- 2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H2' log at tree size 13281
- 2023/03/18 20:05:36 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2023 at tree size 379756
- 2023/03/18 20:05:38 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2024-2 at tree size 110270
- $ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr
-
-Tail in a tmux pane:
-
- $ tail -f collect.stdout
+ $ ct-sans collect -d $HOME/ct-sans-demo
...
- $ head -n1 collect.stdout
- 2023/03/18 20:11:52 cmd_collect.go:150: INFO: collect is up-and-running, ctrl+C to exit
-
-Note that it is safe to ctrl+C, things are written to disk before exit. Just
-run `ct-sans collect` again to start collecting from where we left off.
-
-Tail in a tmux pane:
-
- $ tail -f collect.stderr
- ...
-
-With the above settings it takes 60m before the first status report is printed.
+ INFO: status update before shutdown
+
+ Google 'Argon2023' log | 162.5 entries/s | Estimated done in 1474.01 hours | Working on [11776, 862104911)
+ Google 'Argon2024' log | 157.5 entries/s | Estimated done in 98.31 hours | Working on [11584, 55767940)
+ Google 'Xenon2023' log | 472.6 entries/s | Estimated done in 582.01 hours | Working on [33888, 990277299)
+ Google 'Xenon2024' log | 458.5 entries/s | Estimated done in 40.37 hours | Working on [32896, 66655425)
+ Cloudflare 'Nimbus2023' Log | 276.1 entries/s | Estimated done in 530.24 hours | Working on [19328, 527018586)
+ Cloudflare 'Nimbus2024' Log | 301.2 entries/s | Estimated done in 31.39 hours | Working on [20736, 34050592)
+ DigiCert Yeti2024 Log | 379.1 entries/s | Estimated done in 28.14 hours | Working on [27520, 38426463)
+ DigiCert Yeti2025 Log | 0.0 entries/s | Estimated done in 0.00 hours | Working on [697, 697)
+ DigiCert Nessie2023 Log | 331.3 entries/s | Estimated done in 168.00 hours | Working on [23040, 200387219)
+ DigiCert Nessie2024 Log | 329.8 entries/s | Estimated done in 33.68 hours | Working on [21120, 40017666)
+ DigiCert Nessie2025 Log | 0.0 entries/s | Estimated done in 0.00 hours | Working on [704, 704)
+ Sectigo 'Sabre' CT log | 275.7 entries/s | Estimated done in 230.78 hours | Working on [19456, 229064032)
+ Let's Encrypt 'Oak2023' log | 462.8 entries/s | Estimated done in 280.67 hours | Working on [33664, 467618545)
+ Let's Encrypt 'Oak2024H1' log | 121.4 entries/s | Estimated done in 78.79 hours | Working on [5248, 34451205)
+ Let's Encrypt 'Oak2024H2' log | 0.0 entries/s | Estimated done in 0.00 hours | Working on [14680, 14680)
+ Trust Asia Log2023 | 215.8 entries/s | Estimated done in 0.48 hours | Working on [15872, 388349)
+ Trust Asia Log2024-2 | 246.2 entries/s | Estimated done in 0.11 hours | Working on [17664, 112771)
+
+This will take a while depending on the local system, configuration of the
+optional `ct-sans collect` flags, as well as how heavily the logs apply
+rate-limits. For good performance while respecting rate-limits, you may want
+to try `--workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m`. This
+allowed us to download the logs (March 2023) in approximately 10 days. Our
+single-IP EU machine had a 2 TiB SSD, 64 GiB memory, 16 CPU cores, and 1 Gbps.
+
+Of note is that it is safe to ctrl+C while collecting. Just wait for the
+collect command to exit on its own so that things are persisted to disk.
+
+Once the collect phase is done, assemble the data set:
+
+ $ echo "for demo-purposes, only Nessie2025 and Oak2024H2 are shown below"^C
+ $ ct-sans assemble -d $HOME/ct-sans-demo
+ 2023/03/23 13:05:12 cmd_assemble.go:54: INFO: merging and de-duplicating 2 input files with GNU sort
+ 2023/03/23 13:05:12 cmd_assemble.go:67: INFO: created /home/rgdd/ct-sans-demo/archive/2023-03-23-ct-sans/sans.lst (0.3 MiB)
+ 2023/03/23 13:05:12 cmd_assemble.go:69: INFO: adding notice file
+ 2023/03/23 13:05:12 cmd_assemble.go:87: INFO: adding README
+ 2023/03/23 13:05:12 cmd_assemble.go:96: INFO: adding signed metadata file
+ 2023/03/23 13:05:12 cmd_assemble.go:108: INFO: adding signed tree heads
+ 2023/03/23 13:05:12 cmd_assemble.go:117: INFO: uncompressed dataset available in /home/rgdd/ct-sans-demo/archive/2023-03-23-ct-sans
+ $ cat $HOME/ct-sans-demo/archive/2023-03-23-ct-sans/README.md
+ # ct-sans dataset
+
+ Dataset assembled at Thu Mar 23 13:05:12 CET 2023. Contents:
+
+ - README.md
+ - metadata.json
+ - metadata.sig
+ - sths.json
+ - notice.txt
+ - sans.lst
+
+ The signed [metadata file][] and tree heads were downloaded at
+ Thu Mar 23 12:43:49 CET 2023.
+
+ [metadata file]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto
+
+ In total, 15377 certificates were downloaded from 2 CT logs;
+ 0 certificates contained SANs that could not be parsed.
+ For more information about these errors, see notice.txt.
+
+ The SANs data set is sorted and de-duplicated, one SAN per line.
+
+**Note:** the different ct-sans commands must not run at the same time.
+
+## Updating the data set
+
+Simply run the same snapshot, collect, and assemble commands again.
+
+## Contact
+
+ - IRC: room #certificate-transparency at [OFTC.net][]
+ - Matrix: room [#certificate-transparency][] (bridged with IRC)
+ - Email: rasmus (at) rgdd (dot) se
+
+[OFTC.net]: https://www.oftc.net/
+[#certificate-transparency]: https://app.element.io/#/room/#sauteed-onions:matrix.org/
+
+## Licence
+
+BSD 2-Clause License