From a38f53d7808cc37fa90fb041c698a2ccbc77396e Mon Sep 17 00:00:00 2001
From: Rasmus Dahlberg <rasmus@rgdd.se>
Date: Fri, 17 Mar 2023 09:31:02 +0100
Subject: Add drafty sketch of ct-sans data collection

---
 README.md | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 123 insertions(+)
 create mode 100644 README.md

(limited to 'README.md')

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..605d731
--- /dev/null
+++ b/README.md
@@ -0,0 +1,123 @@
+# The Onion-Location world
+
+This document describes how to estimate the size of the [Onion-Location][]
+world.  The intuition for obtaining an answer is as follows:
+
+  1.  Onion-Location requires HTTPS.  Therefore, a pretty complete list of
+      domains that _may_ offer Onion-Location can be determined by downloading
+      all [CT-logged certificates][] and checking which [SANs][] are in them.
+  2.  Visit the encountered SANs over HTTPS without Tor, looking for if the web
+      server set either the Onion-Location HTTP header or HTML meta-tag.
+
+Please note that this is a _lower-bound estimate_, e.g., because not all web
+browsers enforce CT logging and SANs like `*.example.com` may have any number of
+subdomains with their own Onion-Location configured sites that we won't find.
+
+We start by describing the experimental setup and tools used, followed by
+results as well as availability of the collected and derived datasets.
+
+[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/
+[CT-logged certificates]: https://certificate.transparency.dev/
+[SANs]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6
+
+## Experimental setup
+
+XXX: system(s) used to run the below.
+
+### ct-sans dataset
+
+We put together a tool named [ct-sans][] that facilitates simple creation of a
+data set composed of unique SANs in [CT logs recognized by Google Chrome][].
+
+[ct-sans]: XXX
+[CT logs recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto
+
+Install:
+
+    $ go install ...ct-sans@latest
+    ...
+
+Make a snapshot of which CT logs and entries to download:
+
+    $ ct-sans snapshot -d $HOME/.config/ct-sans
+    ...
+
+Collect, or continue collecting if you decided to shutdown prematurely:
+
+    $ ct-sans collect -d $HOME/.config/ct-sans
+    ...
+
+The collected data is per-log, with each line containing a single SAN:
+
+    $ tail -n3 $HOME/.config/ct-sans/logs/xxx/sans.lst
+    example.org
+    www.example.org
+    *.example.net
+
+[Motivation: easy to maintain the data set, e.g., without needing any special
+indexing and can yearly delete log shards that only contain expired
+certificates with rm -rf.]
+
+The final data set of combined and non-duplicate SANs can be created with the
+UNIX tool `sort`.  For the exact commands and an associated dataset manifest:
+
+    $ ct-sans package -d $HOME/.config/ct-sans
+    sort -u ...
+    ...
+
+Note that you may need to tweak the number of CPUs, available memory, and
+temporary disk space to be suitable for your own system.
+
+### zgrab2 website visits
+
+[ZGrab2][] is an application-level scanner that (among other things) can visit
+HTTPS sites to record encountered HTTP headers and web pages.  This is exactly
+what we need to visit each site in our ct-sans dataset, however only saving the
+output if it indicates that the site is configured with Onion-Location.
+
+Install:
+
+    $ go install github.com/zmap/zgrab2/cmd/zgrab2@latest
+    ...
+
+Run:
+
+    $ zgrab ...
+
+XXX: describe the grep pattern for filtering, and/or wrap in a bash script.
+
+[ZGrab2]: https://github.com/zmap/zgrab2
+
+### Results and data sets
+
+XXX
+
+### Remarks
+
+  - The ct-sans dataset can be updated by running the snapshot, collect, and
+    assemble commands again.  (The snapshot command downloads the latest list of
+    CT logs and their signed tree heads to use as reference while collecting.)
+  - The entire zgrab2 scan needs to be conduced from scratch for each dataset
+    because sites may add or remove Onion-Location configurations at any time.
+  - The above does not make any attempts to visit the announced onion sites.
+
+### Santity check
+
+We will download ~3.4 * 10^9 certificates in total.
+
+We only store SANs, not complete certificates.  Assume that each certificate has
+on average 256 bytes of SANs (1/6 of avg certificate size).  Then:
+
+  256 * 3.4 * 10^9 = 0.8 TiB of SANs.
+
+We will also need temp disk space for sorting and removing duplicates; so if we
+could get a machine with 2TiB disk that should probably be more than enough.
+
+The data needed to be stored after website visits will be ~negligible.
+
+The more RAM and CPU workers we can get, the better.  Same with bandwidth.  For
+running this more continuously in the future, a less powerful machine should do.
+
+XXX: Tobias will request a machine from our department tomorrow, minimum 8 CPUs
+and 32GiB RAM and ~2TiB disk.  Pitch easier if we do website visits with
+mullvad enabled, so we will do that.
-- 
cgit v1.2.3