# The Onion-Location world This document describes how to estimate the size of the [Onion-Location][] world. The intuition for obtaining an answer is as follows: 1. Onion-Location requires HTTPS. Therefore, a pretty complete list of domains that _may_ offer Onion-Location can be determined by downloading all [CT-logged certificates][] and checking which [SANs][] are in them. 2. Visit the encountered SANs over HTTPS without Tor, looking for if the web server set either the Onion-Location HTTP header or HTML meta-tag. Please note that this is a _lower-bound estimate_, e.g., because not all web browsers enforce CT logging and SANs like `*.example.com` may have any number of subdomains with their own Onion-Location configured sites that we won't find. We start by describing the experimental setup and tools used, followed by results as well as availability of the collected and derived datasets. [Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/ [CT-logged certificates]: https://certificate.transparency.dev/ [SANs]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6 ## Experimental setup XXX: system(s) used to run the below. ### ct-sans dataset We put together a tool named [ct-sans][] that facilitates simple creation of a data set composed of unique SANs in [CT logs recognized by Google Chrome][]. [ct-sans]: XXX [CT logs recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto Install: $ go install ...ct-sans@latest ... Make a snapshot of which CT logs and entries to download: $ ct-sans snapshot -d $HOME/.config/ct-sans ... Collect, or continue collecting if you decided to shutdown prematurely: $ ct-sans collect -d $HOME/.config/ct-sans ... The collected data is per-log, with each line containing a single SAN: $ tail -n3 $HOME/.config/ct-sans/logs/xxx/sans.lst example.org www.example.org *.example.net [Motivation: easy to maintain the data set, e.g., without needing any special indexing and can yearly delete log shards that only contain expired certificates with rm -rf.] The final data set of combined and non-duplicate SANs can be created with the UNIX tool `sort`. For the exact commands and an associated dataset manifest: $ ct-sans package -d $HOME/.config/ct-sans sort -u ... ... Note that you may need to tweak the number of CPUs, available memory, and temporary disk space to be suitable for your own system. ### zgrab2 website visits [ZGrab2][] is an application-level scanner that (among other things) can visit HTTPS sites to record encountered HTTP headers and web pages. This is exactly what we need to visit each site in our ct-sans dataset, however only saving the output if it indicates that the site is configured with Onion-Location. Install: $ go install github.com/zmap/zgrab2/cmd/zgrab2@latest ... Run: $ zgrab ... XXX: describe the grep pattern for filtering, and/or wrap in a bash script. [ZGrab2]: https://github.com/zmap/zgrab2 ### Results and data sets XXX ### Remarks - The ct-sans dataset can be updated by running the snapshot, collect, and assemble commands again. (The snapshot command downloads the latest list of CT logs and their signed tree heads to use as reference while collecting.) - The entire zgrab2 scan needs to be conduced from scratch for each dataset because sites may add or remove Onion-Location configurations at any time. - The above does not make any attempts to visit the announced onion sites. ### Santity check We will download ~3.4 * 10^9 certificates in total. We only store SANs, not complete certificates. Assume that each certificate has on average 256 bytes of SANs (1/6 of avg certificate size). Then: 256 * 3.4 * 10^9 = 0.8 TiB of SANs. We will also need temp disk space for sorting and removing duplicates; so if we could get a machine with 2TiB disk that should probably be more than enough. The data needed to be stored after website visits will be ~negligible. The more RAM and CPU workers we can get, the better. Same with bandwidth. For running this more continuously in the future, a less powerful machine should do. XXX: Tobias will request a machine from our department tomorrow, minimum 8 CPUs and 32GiB RAM and ~2TiB disk. Pitch easier if we do website visits with mullvad enabled, so we will do that.