aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorRasmus Dahlberg <rasmus@rgdd.se>2023-04-07 21:34:13 +0200
committerRasmus Dahlberg <rasmus@rgdd.se>2023-04-07 21:34:13 +0200
commit8e0fa61c06fd12c502ea171bee65f5fd63ccb158 (patch)
tree6038b470ba372f1864a019527535c7887d4e5abb
parentd313584dd121271bdaedc6c600e9c49813ce2425 (diff)
Add docs
-rw-r--r--README.md203
-rw-r--r--docs/notes.md24
-rw-r--r--docs/operations.md3
-rw-r--r--docs/setup.md19
4 files changed, 130 insertions, 119 deletions
diff --git a/README.md b/README.md
index d20caa3..4073b4b 100644
--- a/README.md
+++ b/README.md
@@ -7,14 +7,16 @@ A tool that visits a list of domains over HTTPS to see if they have
**Warning:** research prototype. The source code may also be moved.
-**TODO:** update and clean-up README.
-
## Quickstart
-You will need a Go compiler on the local system:
+### Install
+
+You will need a [Go compiler][] on the local system:
$ which go >/dev/null || echo "Go compiler is not in PATH"
+[Go compiler]: https://go.dev/doc/install
+
Install `onion-grab`:
$ go install git.cs.kau.se/rasmoste/onion-grab@latest
@@ -25,137 +27,100 @@ List all options:
### Basic usage
-Store domains in a file; one domain per line:
+Store one domain per line in a file:
$ cat domains.lst
www.eff.org
www.qubes-os.org
www.torproject.org
+
+Run onion-grab with default parameters:
+
$ onion-grab -i domains.lst
- 2023/03/25 17:43:30 INFO: starting await handler, ctrl+C to exit
- 2023/03/25 17:43:30 INFO: starting 2 workers
- 2023/03/25 17:43:30 INFO: starting work aggregator
- 2023/03/25 17:43:30 INFO: generating work
+ 2023/04/07 20:29:45 INFO: ctrl+C to exit prematurely
+ 2023/04/07 20:29:45 INFO: starting 128 workers with limit 64/s
+ 2023/04/07 20:29:45 INFO: starting work receiver
+ 2023/04/07 20:29:45 INFO: starting work generator
www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/
www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
- 2023/03/25 17:43:40 INFO: about to exit, reading remaining answers
- 2023/03/25 17:43:50 SUMMARY: 3/3 connected, 2 sites configured Onion-Location
+ 2023/04/07 20:29:50 INFO: metrics@receiver:
+
+ Processed: 3
+ Success: 3 (Onion-Location:2)
+ Failure: 0 (See breakdown below)
+ Req: 0 (Before sending request)
+ DNS: 0 (NotFound:0 Timeout:0 Other:0)
+ TCP: 0 (Timeout:0 Syscall:0)
+ TLS: 0 (Cert:0 Other:0)
+ 3xx: 0 (Too many redirects)
+ EOF: 0 (Unclear meaning)
+ CTX: 0 (Deadline exceeded)
+ ???: 0 (Other errors)
+
+ 2023/04/07 20:29:51 INFO: about to exit in at most 11s, reading remaining answers
+ 2023/04/07 20:29:57 INFO: metrics@receiver: summary:
+
+ Processed: 3
+ Success: 3 (Onion-Location:2)
+ Failure: 0 (See breakdown below)
+ Req: 0 (Before sending request)
+ DNS: 0 (NotFound:0 Timeout:0 Other:0)
+ TCP: 0 (Timeout:0 Syscall:0)
+ TLS: 0 (Cert:0 Other:0)
+ 3xx: 0 (Too many redirects)
+ EOF: 0 (Unclear meaning)
+ CTX: 0 (Deadline exceeded)
+ ???: 0 (Other errors)
+
+ 2023/04/07 20:29:57 INFO: measurement duration was 12s
Sites with Onion-Location are printed to stdout, here showing that
`www.torproject.org` configures it with an HTTP header while `www.qubes-os.org`
does it with an HTML attribute. All three sites connected successfully.
-### Working with a larger list
-
-Below the [Tranco top-1M][] list is used as an example; 100 workers are
-specified, metrics are printed every 15s, and sanity-checks against a site with
-Onion-Location which should be reachable are carried out every 60s.
-
- $ cut -d ',' -f2 top-1m.csv > top-1m.lst
- $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se
- 2023/03/25 17:44:20 INFO: starting await handler, ctrl+C to exit
- 2023/03/25 17:44:20 INFO: starting checker
- 2023/03/25 17:44:20 INFO: starting 100 workers
- 2023/03/25 17:44:20 INFO: starting work aggregator
- 2023/03/25 17:44:20 INFO: generating work
- nytimes.com header=https://www.nytimesn7cgmftshazwhfgzm37qxb44r64ytbb2dj3x62d2lljsciiyd.onion/ attribute=
- twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/
- theguardian.com header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute=
- 2023/03/25 17:44:22 Transport: unhandled response frame type *http.http2UnknownFrame
- 2023/03/25 17:44:31 Transport: unhandled response frame type *http.http2UnknownFrame
- dw.com header=https://www.dwnewsgngmhlplxy6o2twtfgjnrnjxbegbwqx6wnotdhkzt562tszfid.onion/ attribute=
- brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute=
- 2023/03/25 17:44:35 INFO: currently 72.3 sites/s, 72.3 sites/s since start
- guardian.co.uk header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute=
- proton.me header=https://protonmailrmez3lotccipshtkleegetolb73fuirgj7r4o4vfu7ozyd.onion/ attribute=
- voanews.com header=https://www.voanews5aitmne6gs2btokcacixclgfl43cv27sirgbauyyjylwpdtqd.onion/ attribute=
- 2023/03/25 17:44:50 INFO: currently 64.3 sites/s, 68.3 sites/s since start
- ^C2023/03/25 17:44:51 INFO: about to exit, reading remaining answers
- 2023/03/25 17:44:51 NOTICE: only read up until line 2089
- 2023/03/25 17:45:01 SUMMARY: 1488/2089 connected, 8 sites configured Onion-Location
+In case of errors, the type of error is identified with relatively few `???`.
-[Tranco top-1M]: https://tranco-list.eu/latest_list
+### Scripts
-Note that `ctrl+C` can be used to exit early as shown above. To continue from
-where you left off (line `2089`), specify the `-n` option on the next run:
-
- $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se -n 2089
- 2023/03/25 17:45:57 INFO: starting await handler, ctrl+C to exit
- 2023/03/25 17:45:57 INFO: starting checker
- 2023/03/25 17:45:57 INFO: starting 100 workers
- 2023/03/25 17:45:57 INFO: starting work aggregator
- 2023/03/25 17:45:57 INFO: generating work
- cia.gov header= attribute=http://ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion
- 2023/03/25 17:46:12 INFO: currently 79.9 sites/s, 79.9 sites/s since start
- propublica.org header=http://p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd.onion/ attribute=
- theintercept.com header=https://54dus3ggt7uxz7wjvhkia2ntxmz5lkhbvgohrwur43trt3d6vrcvfmqd.onion/ attribute=
- torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
- 2023/03/25 17:46:27 INFO: currently 75.1 sites/s, 77.5 sites/s since start
- ^C2023/03/25 17:46:28 INFO: about to exit, reading remaining answers
- 2023/03/25 17:46:28 NOTICE: only read up until line 4487 (line 2398 relative to start)
- 2023/03/25 17:46:38 SUMMARY: 1609/2399 connected, 4 sites configured Onion-Location
-
-## Known issues
-
-### Too many parallel workers
-
-Here's what would happen if the local system cannot handle the number of workers:
-
- $ onion-grab -i top-1m.lst -w 1000 -m 15s -C 60s -c rgdd.se
- 2023/03/25 17:47:36 INFO: starting await handler, ctrl+C to exit
- 2023/03/25 17:47:36 INFO: starting checker
- 2023/03/25 17:47:36 INFO: starting 1000 workers
- 2023/03/25 17:47:36 INFO: starting work aggregator
- 2023/03/25 17:47:36 INFO: generating work
- 2023/03/25 17:47:36 Transport: unhandled response frame type *http.http2UnknownFrame
- twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/
- 2023/03/25 17:47:37 Transport: unhandled response frame type *http.http2UnknownFrame
- brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute=
- 2023/03/25 17:47:51 INFO: currently 151.3 sites/s, 151.3 sites/s since start
- 2023/03/25 17:48:06 INFO: currently 78.1 sites/s, 114.7 sites/s since start
- 2023/03/25 17:48:21 INFO: currently 121.9 sites/s, 117.1 sites/s since start
- 2023/03/25 17:48:36 INFO: currently 78.1 sites/s, 107.4 sites/s since start
- 2023/03/25 17:48:46 ERROR: checker expected onion for {Domain:rgdd.se OK:false HTTP: HTML:}
- 2023/03/25 17:48:46 NOTICE: only read up until line 7442
- 2023/03/25 17:48:46 INFO: about to exit, reading remaining answers
- 2023/03/25 17:48:56 SUMMARY: 232/7442 connected, 2 sites configured Onion-Location
-
-This is most likely an OS problem; not an onion-grab problem. Debug hints:
-
- - Stop and disable `systemd-resolved`, then specify a recursive resolver that
- can handle the expected load.
- - You may need to tinker with kernel tunables, see `ulimit -a` and `sysctl -a`
- for what can be configured. For example, if you find that the error is
- caused by too many open files, try increasing the value of `ulimit -n`.
-
-**Credit:** Björn Töpel helped debugging this issue.
-
-**Note:** domains with Onion-Location are likely to be missed if `-n 7442` is
-used here in a subsequent run. For example, with `-C 60s` and an average of 100
-domains/s, it would be wise to roll-back _at least_ 6000 lines. This should be
-a last-resort option, and is mainly here to sanity-check long measurements.
-
-### Misc notes
-
-We use the default `net.Dial` function, which in turn uses
-[goLookupIPCNAMEOrder][] for DNS lookups with the recursive name servers in
-`/etc/resolve.conf`. For example, with
-
- $ cat /etc/resolve.conf
- nameserver 8.8.8.8
- nameserver 8.8.4.4
-
-the query will first be directed to `8.8.8.8`; then `8.8.4.4` if no valid answer
-is available yet ([lines 663-778][]).
-
-[goLookupIPCNAMEOrder]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L595-L804
-[lines 663-778]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L663-L778
-
-Also note that the default settings makes us follow redirects (at most 10). And
-that 10MiB for MaxResponseHeaderBytes should be [conservative][]; so it seems
-like we should not be too strict if setting this to 32MiB.
-
-[conservative]: https://go-review.googlesource.com/c/go/+/21329/2/src/net/http/transport.go
+Digest the results, here stored as `onion-grab.stdout`:
+
+ $ cat onion-grab.stdout
+ www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/
+ www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
+ $ ./scripts/digest.py -i onion-grab.stdout
+ digest.py:25 INFO: found 1 HTTP headers with Onion-Location
+ digest.py:26 INFO: found 1 HTML meta attributes with Onion-Location
+ digest.py:27 INFO: found 2 unqiue domain names that set Onion-Location
+ digest.py:28 INFO: found 2 unique two-label onion addresses in the process
+ digest.py:30 INFO: storing domains with valid Onion-Location configurations in domains.txt
+ digest.py:35 INFO: storing two-label onion addresses that domains referenced in onions.txt
+ $ cat domains.txt
+ www.qubes-os.org http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/
+ www.torproject.org http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html
+ $ cat onions.txt
+ qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion www.qubes-os.org
+ 2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion www.torproject.org
+
+In other words, the digest script prints some information and writes two files:
+
+ - `domains.txt`: domains that configured valid Onion-Location headers. The
+ listed Onion-Location values are de-duplicated and space-separated.
+ - `onions.txt`: two-label `.onion` addresses that were discovered. The listed
+ domains referenced this address in their Onion-Location configuration,
+ possibly with subdomains, paths, etc., that were removed. Such pruning of
+ the set Onion-Location values is useful to estimate the number of onions.
+
+See [scripts/test.sh](./scripts/test.sh) and if you are looking to test
+different `onion-grab` configuration. You may find
+[scripts/measure.sh](scripts/measure.sh) to be a useful measurement script.
+
+## Running a larger measurement
+
+See [docs/operations.md](TODO)
+for measurements of [Tranco top-1M][] and [ct-sans][].
+
+[Tranco top-1M]: https://tranco-list.eu/latest_list
+[ct-sans]: https://git.cs.kau.se/rasmoste/ct-sans/-/blob/main/docs/operations.md
## Contact
diff --git a/docs/notes.md b/docs/notes.md
new file mode 100644
index 0000000..d95c6f0
--- /dev/null
+++ b/docs/notes.md
@@ -0,0 +1,24 @@
+# Notes
+
+`onion-grab` uses use the default `net.Dial` function, which in turn uses
+[goLookupIPCNAMEOrder][] for DNS lookups with the recursive name servers in
+`/etc/resolve.conf`. For example, with
+
+ $ cat /etc/resolve.conf
+ nameserver 8.8.8.8
+ nameserver 8.8.4.4
+
+[goLookupIPCNAMEOrder]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L595-L804
+
+the query will first be directed to `8.8.8.8`; then `8.8.4.4` if no valid answer
+is available yet ([lines 663-778][]). If you are running `onion-grab` with
+[Mullvad VPN][], specify custom DNS: `mullvad dns set custom 8.8.8.8 8.8.4.4`.
+
+[lines 663-778]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L663-L778
+[Mullvad VPN]: https://www.mullvad.net/
+
+Further, default settings are used to follow at most 10 HTTP 3XX redirects. A
+[conservative][] value for the `MaxResponseHeaderBytes` option is 10MiB; the
+`onion-grab` default is 16MiB and our measurements bumped this up to 64MiB.
+
+[conservative]: https://go-review.googlesource.com/c/go/+/21329/2/src/net/http/transport.go
diff --git a/docs/operations.md b/docs/operations.md
new file mode 100644
index 0000000..1528c32
--- /dev/null
+++ b/docs/operations.md
@@ -0,0 +1,3 @@
+# Operations
+
+Placeholder.
diff --git a/docs/setup.md b/docs/setup.md
new file mode 100644
index 0000000..7e4bdb8
--- /dev/null
+++ b/docs/setup.md
@@ -0,0 +1,19 @@
+# Setup
+
+`onion-grab` has been tested on Ubuntu/Debian based systems. If you are running
+a large measurement, you may run into issues that are **OS related**.
+
+## Hints
+
+ - We disabled and stopped `systemd-resolved`, which eventually causes some or
+ all DNS requests to be blocked when running with many concurrent workers.
+ - We used Google's `8.8.8.8` and `8.8.4.4`, which [supports 1500qps][] per IP.
+ - You may need to tinker with `ulimit` and `sysctl`, e.g., if observing that
+ there are too many open file descriptors or similar. See for example the
+ value of `ulimit -n` and `sysctl net.ipv4.ip_local_port_range`
+
+[supports 1500qps]: https://developers.google.com/speed/public-dns/docs/isp
+
+## Credit
+
+Björn Töpel helped us debug some of these OS-related issues.