diff options
| -rw-r--r-- | README.md | 123 | 
1 files changed, 123 insertions, 0 deletions
| diff --git a/README.md b/README.md new file mode 100644 index 0000000..605d731 --- /dev/null +++ b/README.md @@ -0,0 +1,123 @@ +# The Onion-Location world + +This document describes how to estimate the size of the [Onion-Location][] +world.  The intuition for obtaining an answer is as follows: + +  1.  Onion-Location requires HTTPS.  Therefore, a pretty complete list of +      domains that _may_ offer Onion-Location can be determined by downloading +      all [CT-logged certificates][] and checking which [SANs][] are in them. +  2.  Visit the encountered SANs over HTTPS without Tor, looking for if the web +      server set either the Onion-Location HTTP header or HTML meta-tag. + +Please note that this is a _lower-bound estimate_, e.g., because not all web +browsers enforce CT logging and SANs like `*.example.com` may have any number of +subdomains with their own Onion-Location configured sites that we won't find. + +We start by describing the experimental setup and tools used, followed by +results as well as availability of the collected and derived datasets. + +[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/ +[CT-logged certificates]: https://certificate.transparency.dev/ +[SANs]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6 + +## Experimental setup + +XXX: system(s) used to run the below. + +### ct-sans dataset + +We put together a tool named [ct-sans][] that facilitates simple creation of a +data set composed of unique SANs in [CT logs recognized by Google Chrome][]. + +[ct-sans]: XXX +[CT logs recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto + +Install: + +    $ go install ...ct-sans@latest +    ... + +Make a snapshot of which CT logs and entries to download: + +    $ ct-sans snapshot -d $HOME/.config/ct-sans +    ... + +Collect, or continue collecting if you decided to shutdown prematurely: + +    $ ct-sans collect -d $HOME/.config/ct-sans +    ... + +The collected data is per-log, with each line containing a single SAN: + +    $ tail -n3 $HOME/.config/ct-sans/logs/xxx/sans.lst +    example.org +    www.example.org +    *.example.net + +[Motivation: easy to maintain the data set, e.g., without needing any special +indexing and can yearly delete log shards that only contain expired +certificates with rm -rf.] + +The final data set of combined and non-duplicate SANs can be created with the +UNIX tool `sort`.  For the exact commands and an associated dataset manifest: + +    $ ct-sans package -d $HOME/.config/ct-sans +    sort -u ... +    ... + +Note that you may need to tweak the number of CPUs, available memory, and +temporary disk space to be suitable for your own system. + +### zgrab2 website visits + +[ZGrab2][] is an application-level scanner that (among other things) can visit +HTTPS sites to record encountered HTTP headers and web pages.  This is exactly +what we need to visit each site in our ct-sans dataset, however only saving the +output if it indicates that the site is configured with Onion-Location. + +Install: + +    $ go install github.com/zmap/zgrab2/cmd/zgrab2@latest +    ... + +Run: + +    $ zgrab ... + +XXX: describe the grep pattern for filtering, and/or wrap in a bash script. + +[ZGrab2]: https://github.com/zmap/zgrab2 + +### Results and data sets + +XXX + +### Remarks + +  - The ct-sans dataset can be updated by running the snapshot, collect, and +    assemble commands again.  (The snapshot command downloads the latest list of +    CT logs and their signed tree heads to use as reference while collecting.) +  - The entire zgrab2 scan needs to be conduced from scratch for each dataset +    because sites may add or remove Onion-Location configurations at any time. +  - The above does not make any attempts to visit the announced onion sites. + +### Santity check + +We will download ~3.4 * 10^9 certificates in total. + +We only store SANs, not complete certificates.  Assume that each certificate has +on average 256 bytes of SANs (1/6 of avg certificate size).  Then: + +  256 * 3.4 * 10^9 = 0.8 TiB of SANs. + +We will also need temp disk space for sorting and removing duplicates; so if we +could get a machine with 2TiB disk that should probably be more than enough. + +The data needed to be stored after website visits will be ~negligible. + +The more RAM and CPU workers we can get, the better.  Same with bandwidth.  For +running this more continuously in the future, a less powerful machine should do. + +XXX: Tobias will request a machine from our department tomorrow, minimum 8 CPUs +and 32GiB RAM and ~2TiB disk.  Pitch easier if we do website visits with +mullvad enabled, so we will do that. | 
