aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 7941e205506db8ff2c1a0c57a577b19f8c37ab5f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
# The Onion-Location world

This document describes how to estimate the size of the [Onion-Location][]
world.  The intuition for obtaining an answer is as follows:

  1.  Onion-Location requires HTTPS.  Therefore, a pretty complete list of
      domains that _may_ offer Onion-Location can be determined by downloading
      all [CT-logged certificates][] and checking which [SANs][] are in them.
  2.  Visit the encountered SANs over HTTPS without Tor, looking for if the web
      server set either the Onion-Location HTTP header or HTML meta-tag.

Please note that this is a _lower-bound estimate_, e.g., because not all web
browsers enforce CT logging and SANs like `*.example.com` may have any number of
subdomains with their own Onion-Location configured sites that we won't find.

We start by describing the experimental setup and tools used, followed by
results as well as availability of the collected and derived datasets.

[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/
[CT-logged certificates]: https://certificate.transparency.dev/
[SANs]: https://www.rfc-editor.org/rfc/rfc5280#section-4.2.1.6

## Experimental setup

XXX: system(s) used to run the below.

### ct-sans dataset

We put together a tool named [ct-sans][] that facilitates simple creation of a
data set composed of unique SANs in [CT logs recognized by Google Chrome][].

[ct-sans]: XXX
[CT logs recognized by Google Chrome]: https://groups.google.com/a/chromium.org/g/ct-policy/c/IdbrdAcDQto

Install:

    $ go install ...ct-sans@latest
    ...

Make a snapshot of which CT logs and entries to download:

    $ ct-sans snapshot -d $HOME/.config/ct-sans
    ...

Collect, or continue collecting if you decided to shutdown prematurely:

    $ ct-sans collect -d $HOME/.config/ct-sans
    ...

The collected data is per-log, with each line containing a single SAN:

    $ tail -n3 $HOME/.config/ct-sans/logs/xxx/sans.lst
    example.org
    www.example.org
    *.example.net

[Motivation: easy to maintain the data set, e.g., without needing any special
indexing and can yearly delete log shards that only contain expired
certificates with rm -rf.]

The final data set of combined and non-duplicate SANs can be created with the
UNIX tool `sort`.  For the exact commands and an associated dataset manifest:

    $ ct-sans package -d $HOME/.config/ct-sans
    sort -u ...
    ...

    sort -Vuo sans.lst --buffer-size=1024K --temporary-directory=/tmp/t --parallel=2 a.lst b.lst

Note that you may need to tweak the number of CPUs, available memory, and
temporary disk space to be suitable for your own system.

### zgrab2 website visits

[ZGrab2][] is an application-level scanner that (among other things) can visit
HTTPS sites to record encountered HTTP headers and web pages.  This is exactly
what we need to visit each site in our ct-sans dataset, however only saving the
output if it indicates that the site is configured with Onion-Location.

Install:

    $ go install github.com/zmap/zgrab2/cmd/zgrab2@latest
    ...

Run:

    $ zgrab ...

XXX: describe the grep pattern for filtering, and/or wrap in a bash script.

[ZGrab2]: https://github.com/zmap/zgrab2

### Results and data sets

XXX

### Remarks

  - The ct-sans dataset can be updated by running the snapshot, collect, and
    assemble commands again.  (The snapshot command downloads the latest list of
    CT logs and their signed tree heads to use as reference while collecting.)
  - The entire zgrab2 scan needs to be conduced from scratch for each dataset
    because sites may add or remove Onion-Location configurations at any time.
  - The above does not make any attempts to visit the announced onion sites.

### Santity check

We will download ~3.4 * 10^9 certificates in total.

We only store SANs, not complete certificates.  Assume that each certificate has
on average 256 bytes of SANs (1/6 of avg certificate size).  Then:

  256 * 3.4 * 10^9 = 0.8 TiB of SANs.

We will also need temp disk space for sorting and removing duplicates; so if we
could get a machine with 2TiB disk that should probably be more than enough.

The data needed to be stored after website visits will be ~negligible.

The more RAM and CPU workers we can get, the better.  Same with bandwidth.  For
running this more continuously in the future, a less powerful machine should do.

XXX: Tobias will request a machine from our department tomorrow, minimum 8 CPUs
and 32GiB RAM and ~2TiB disk.  Pitch easier if we do website visits with
mullvad enabled, so we will do that.

### Notes from starting ct-sans on our machine

    $ go install git.cs.kau.se/rasmoste/ct-sans@v0.0.1
    $ ct-sans snapshot >snapshot.stdout
    $ cat snapshot.stdout
    2023/03/18 20:05:30 cmd_snapshot.go:30: INFO: updating metadata file
    2023/03/18 20:05:30 cmd_snapshot.go:47: INFO: updating signed tree heads
    2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2023' log at tree size 841710936
    2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Argon2024' log at tree size 52689060
    2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2023' log at tree size 966608751
    2023/03/18 20:05:30 cmd_snapshot.go:82: INFO: bootstrapped Google 'Xenon2024' log at tree size 63025768
    2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2023' Log at tree size 513025681
    2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped Cloudflare 'Nimbus2024' Log at tree size 31749516
    2023/03/18 20:05:31 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2024 Log at tree size 36293063
    2023/03/18 20:05:32 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Yeti2025 Log at tree size 687
    2023/03/18 20:05:33 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2023 Log at tree size 198950777
    2023/03/18 20:05:33 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2024 Log at tree size 37773373
    2023/03/18 20:05:34 cmd_snapshot.go:82: INFO: bootstrapped DigiCert Nessie2025 Log at tree size 694
    2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Sectigo 'Sabre' CT log at tree size 228782818
    2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2023' log at tree size 457590228
    2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H1' log at tree size 32863293
    2023/03/18 20:05:35 cmd_snapshot.go:82: INFO: bootstrapped Let's Encrypt 'Oak2024H2' log at tree size 13281
    2023/03/18 20:05:36 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2023 at tree size 379756
    2023/03/18 20:05:38 cmd_snapshot.go:82: INFO: bootstrapped Trust Asia Log2024-2 at tree size 110270
    $ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr

Tail in a tmux pane:

    $ tail -f collect.stdout
    ...
    $ head -n1 collect.stdout
    2023/03/18 20:11:52 cmd_collect.go:150: INFO: collect is up-and-running, ctrl+C to exit

Note that it is safe to ctrl+C, things are written to disk before exit.  Just
run `ct-sans collect` again to start collecting from where we left off.

Tail in a tmux pane:

    $ tail -f collect.stderr
    ...

With the above settings it takes 60m before the first status report is printed.