From a38f53d7808cc37fa90fb041c698a2ccbc77396e Mon Sep 17 00:00:00 2001 From: Rasmus Dahlberg Date: Fri, 17 Mar 2023 09:31:02 +0100 Subject: Add drafty sketch of ct-sans data collection --- README.md | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..605d731 --- /dev/null +++ b/README.md @@ -0,0 +1,123 @@ +# The Onion-Location world + +This document describes how to estimate the size of the [Onion-Location][] +world. The intuition for obtaining an answer is as follows: + + 1. Onion-Location requires HTTPS. Therefore, a pretty complete list of + domains that _may_ offer Onion-Location can be determined by downloading + all [CT-logged certificates][] and checking which [SANs][] are in them. + 2. Visit the encountered SANs over HTTPS without Tor, looking for if the web + server set either the Onion-Location HTTP header or HTML meta-tag. + +Please note that this is a _lower-bound estimate_, e.g., because not all web +browsers enforce CT logging and SANs like `*.example.com` may have any number of +subdomains with their own Onion-Location configured sites that we won't find. + +We start by describing the experimental setup and tools used, followed by +results as well as availability of the collected and derived datasets. + +[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/ +[CT-logged certificates]: https://certificate.transparency.dev/ +[SANs]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6 + +## Experimental setup + +XXX: system(s) used to run the below. + +### ct-sans dataset + +We put together a tool named [ct-sans][] that facilitates simple creation of a +data set composed of unique SANs in [CT logs recognized by Google Chrome][]. + +[ct-sans]: XXX +[CT logs recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto + +Install: + + $ go install ...ct-sans@latest + ... + +Make a snapshot of which CT logs and entries to download: + + $ ct-sans snapshot -d $HOME/.config/ct-sans + ... + +Collect, or continue collecting if you decided to shutdown prematurely: + + $ ct-sans collect -d $HOME/.config/ct-sans + ... + +The collected data is per-log, with each line containing a single SAN: + + $ tail -n3 $HOME/.config/ct-sans/logs/xxx/sans.lst + example.org + www.example.org + *.example.net + +[Motivation: easy to maintain the data set, e.g., without needing any special +indexing and can yearly delete log shards that only contain expired +certificates with rm -rf.] + +The final data set of combined and non-duplicate SANs can be created with the +UNIX tool `sort`. For the exact commands and an associated dataset manifest: + + $ ct-sans package -d $HOME/.config/ct-sans + sort -u ... + ... + +Note that you may need to tweak the number of CPUs, available memory, and +temporary disk space to be suitable for your own system. + +### zgrab2 website visits + +[ZGrab2][] is an application-level scanner that (among other things) can visit +HTTPS sites to record encountered HTTP headers and web pages. This is exactly +what we need to visit each site in our ct-sans dataset, however only saving the +output if it indicates that the site is configured with Onion-Location. + +Install: + + $ go install github.com/zmap/zgrab2/cmd/zgrab2@latest + ... + +Run: + + $ zgrab ... + +XXX: describe the grep pattern for filtering, and/or wrap in a bash script. + +[ZGrab2]: https://github.com/zmap/zgrab2 + +### Results and data sets + +XXX + +### Remarks + + - The ct-sans dataset can be updated by running the snapshot, collect, and + assemble commands again. (The snapshot command downloads the latest list of + CT logs and their signed tree heads to use as reference while collecting.) + - The entire zgrab2 scan needs to be conduced from scratch for each dataset + because sites may add or remove Onion-Location configurations at any time. + - The above does not make any attempts to visit the announced onion sites. + +### Santity check + +We will download ~3.4 * 10^9 certificates in total. + +We only store SANs, not complete certificates. Assume that each certificate has +on average 256 bytes of SANs (1/6 of avg certificate size). Then: + + 256 * 3.4 * 10^9 = 0.8 TiB of SANs. + +We will also need temp disk space for sorting and removing duplicates; so if we +could get a machine with 2TiB disk that should probably be more than enough. + +The data needed to be stored after website visits will be ~negligible. + +The more RAM and CPU workers we can get, the better. Same with bandwidth. For +running this more continuously in the future, a less powerful machine should do. + +XXX: Tobias will request a machine from our department tomorrow, minimum 8 CPUs +and 32GiB RAM and ~2TiB disk. Pitch easier if we do website visits with +mullvad enabled, so we will do that. -- cgit v1.2.3