aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorRasmus Dahlberg <rasmus@rgdd.se>2023-03-17 09:31:02 +0100
committerRasmus Dahlberg <rasmus@rgdd.se>2023-03-17 09:31:02 +0100
commita38f53d7808cc37fa90fb041c698a2ccbc77396e (patch)
tree5eca05b7c76640c972ef8899d9171d98d3b0bfe5
Add drafty sketch of ct-sans data collection
-rw-r--r--README.md123
1 files changed, 123 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..605d731
--- /dev/null
+++ b/README.md
@@ -0,0 +1,123 @@
+# The Onion-Location world
+
+This document describes how to estimate the size of the [Onion-Location][]
+world. The intuition for obtaining an answer is as follows:
+
+ 1. Onion-Location requires HTTPS. Therefore, a pretty complete list of
+ domains that _may_ offer Onion-Location can be determined by downloading
+ all [CT-logged certificates][] and checking which [SANs][] are in them.
+ 2. Visit the encountered SANs over HTTPS without Tor, looking for if the web
+ server set either the Onion-Location HTTP header or HTML meta-tag.
+
+Please note that this is a _lower-bound estimate_, e.g., because not all web
+browsers enforce CT logging and SANs like `*.example.com` may have any number of
+subdomains with their own Onion-Location configured sites that we won't find.
+
+We start by describing the experimental setup and tools used, followed by
+results as well as availability of the collected and derived datasets.
+
+[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/
+[CT-logged certificates]: https://certificate.transparency.dev/
+[SANs]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6
+
+## Experimental setup
+
+XXX: system(s) used to run the below.
+
+### ct-sans dataset
+
+We put together a tool named [ct-sans][] that facilitates simple creation of a
+data set composed of unique SANs in [CT logs recognized by Google Chrome][].
+
+[ct-sans]: XXX
+[CT logs recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto
+
+Install:
+
+ $ go install ...ct-sans@latest
+ ...
+
+Make a snapshot of which CT logs and entries to download:
+
+ $ ct-sans snapshot -d $HOME/.config/ct-sans
+ ...
+
+Collect, or continue collecting if you decided to shutdown prematurely:
+
+ $ ct-sans collect -d $HOME/.config/ct-sans
+ ...
+
+The collected data is per-log, with each line containing a single SAN:
+
+ $ tail -n3 $HOME/.config/ct-sans/logs/xxx/sans.lst
+ example.org
+ www.example.org
+ *.example.net
+
+[Motivation: easy to maintain the data set, e.g., without needing any special
+indexing and can yearly delete log shards that only contain expired
+certificates with rm -rf.]
+
+The final data set of combined and non-duplicate SANs can be created with the
+UNIX tool `sort`. For the exact commands and an associated dataset manifest:
+
+ $ ct-sans package -d $HOME/.config/ct-sans
+ sort -u ...
+ ...
+
+Note that you may need to tweak the number of CPUs, available memory, and
+temporary disk space to be suitable for your own system.
+
+### zgrab2 website visits
+
+[ZGrab2][] is an application-level scanner that (among other things) can visit
+HTTPS sites to record encountered HTTP headers and web pages. This is exactly
+what we need to visit each site in our ct-sans dataset, however only saving the
+output if it indicates that the site is configured with Onion-Location.
+
+Install:
+
+ $ go install github.com/zmap/zgrab2/cmd/zgrab2@latest
+ ...
+
+Run:
+
+ $ zgrab ...
+
+XXX: describe the grep pattern for filtering, and/or wrap in a bash script.
+
+[ZGrab2]: https://github.com/zmap/zgrab2
+
+### Results and data sets
+
+XXX
+
+### Remarks
+
+ - The ct-sans dataset can be updated by running the snapshot, collect, and
+ assemble commands again. (The snapshot command downloads the latest list of
+ CT logs and their signed tree heads to use as reference while collecting.)
+ - The entire zgrab2 scan needs to be conduced from scratch for each dataset
+ because sites may add or remove Onion-Location configurations at any time.
+ - The above does not make any attempts to visit the announced onion sites.
+
+### Santity check
+
+We will download ~3.4 * 10^9 certificates in total.
+
+We only store SANs, not complete certificates. Assume that each certificate has
+on average 256 bytes of SANs (1/6 of avg certificate size). Then:
+
+ 256 * 3.4 * 10^9 = 0.8 TiB of SANs.
+
+We will also need temp disk space for sorting and removing duplicates; so if we
+could get a machine with 2TiB disk that should probably be more than enough.
+
+The data needed to be stored after website visits will be ~negligible.
+
+The more RAM and CPU workers we can get, the better. Same with bandwidth. For
+running this more continuously in the future, a less powerful machine should do.
+
+XXX: Tobias will request a machine from our department tomorrow, minimum 8 CPUs
+and 32GiB RAM and ~2TiB disk. Pitch easier if we do website visits with
+mullvad enabled, so we will do that.