From 8e0fa61c06fd12c502ea171bee65f5fd63ccb158 Mon Sep 17 00:00:00 2001 From: Rasmus Dahlberg Date: Fri, 7 Apr 2023 21:34:13 +0200 Subject: Add docs --- README.md | 203 ++++++++++++++++++++++------------------------------- docs/notes.md | 24 +++++++ docs/operations.md | 3 + docs/setup.md | 19 +++++ 4 files changed, 130 insertions(+), 119 deletions(-) create mode 100644 docs/notes.md create mode 100644 docs/operations.md create mode 100644 docs/setup.md diff --git a/README.md b/README.md index d20caa3..4073b4b 100644 --- a/README.md +++ b/README.md @@ -7,14 +7,16 @@ A tool that visits a list of domains over HTTPS to see if they have **Warning:** research prototype. The source code may also be moved. -**TODO:** update and clean-up README. - ## Quickstart -You will need a Go compiler on the local system: +### Install + +You will need a [Go compiler][] on the local system: $ which go >/dev/null || echo "Go compiler is not in PATH" +[Go compiler]: https://go.dev/doc/install + Install `onion-grab`: $ go install git.cs.kau.se/rasmoste/onion-grab@latest @@ -25,137 +27,100 @@ List all options: ### Basic usage -Store domains in a file; one domain per line: +Store one domain per line in a file: $ cat domains.lst www.eff.org www.qubes-os.org www.torproject.org + +Run onion-grab with default parameters: + $ onion-grab -i domains.lst - 2023/03/25 17:43:30 INFO: starting await handler, ctrl+C to exit - 2023/03/25 17:43:30 INFO: starting 2 workers - 2023/03/25 17:43:30 INFO: starting work aggregator - 2023/03/25 17:43:30 INFO: generating work + 2023/04/07 20:29:45 INFO: ctrl+C to exit prematurely + 2023/04/07 20:29:45 INFO: starting 128 workers with limit 64/s + 2023/04/07 20:29:45 INFO: starting work receiver + 2023/04/07 20:29:45 INFO: starting work generator www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/ www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute= - 2023/03/25 17:43:40 INFO: about to exit, reading remaining answers - 2023/03/25 17:43:50 SUMMARY: 3/3 connected, 2 sites configured Onion-Location + 2023/04/07 20:29:50 INFO: metrics@receiver: + + Processed: 3 + Success: 3 (Onion-Location:2) + Failure: 0 (See breakdown below) + Req: 0 (Before sending request) + DNS: 0 (NotFound:0 Timeout:0 Other:0) + TCP: 0 (Timeout:0 Syscall:0) + TLS: 0 (Cert:0 Other:0) + 3xx: 0 (Too many redirects) + EOF: 0 (Unclear meaning) + CTX: 0 (Deadline exceeded) + ???: 0 (Other errors) + + 2023/04/07 20:29:51 INFO: about to exit in at most 11s, reading remaining answers + 2023/04/07 20:29:57 INFO: metrics@receiver: summary: + + Processed: 3 + Success: 3 (Onion-Location:2) + Failure: 0 (See breakdown below) + Req: 0 (Before sending request) + DNS: 0 (NotFound:0 Timeout:0 Other:0) + TCP: 0 (Timeout:0 Syscall:0) + TLS: 0 (Cert:0 Other:0) + 3xx: 0 (Too many redirects) + EOF: 0 (Unclear meaning) + CTX: 0 (Deadline exceeded) + ???: 0 (Other errors) + + 2023/04/07 20:29:57 INFO: measurement duration was 12s Sites with Onion-Location are printed to stdout, here showing that `www.torproject.org` configures it with an HTTP header while `www.qubes-os.org` does it with an HTML attribute. All three sites connected successfully. -### Working with a larger list - -Below the [Tranco top-1M][] list is used as an example; 100 workers are -specified, metrics are printed every 15s, and sanity-checks against a site with -Onion-Location which should be reachable are carried out every 60s. - - $ cut -d ',' -f2 top-1m.csv > top-1m.lst - $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se - 2023/03/25 17:44:20 INFO: starting await handler, ctrl+C to exit - 2023/03/25 17:44:20 INFO: starting checker - 2023/03/25 17:44:20 INFO: starting 100 workers - 2023/03/25 17:44:20 INFO: starting work aggregator - 2023/03/25 17:44:20 INFO: generating work - nytimes.com header=https://www.nytimesn7cgmftshazwhfgzm37qxb44r64ytbb2dj3x62d2lljsciiyd.onion/ attribute= - twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/ - theguardian.com header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute= - 2023/03/25 17:44:22 Transport: unhandled response frame type *http.http2UnknownFrame - 2023/03/25 17:44:31 Transport: unhandled response frame type *http.http2UnknownFrame - dw.com header=https://www.dwnewsgngmhlplxy6o2twtfgjnrnjxbegbwqx6wnotdhkzt562tszfid.onion/ attribute= - brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute= - 2023/03/25 17:44:35 INFO: currently 72.3 sites/s, 72.3 sites/s since start - guardian.co.uk header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute= - proton.me header=https://protonmailrmez3lotccipshtkleegetolb73fuirgj7r4o4vfu7ozyd.onion/ attribute= - voanews.com header=https://www.voanews5aitmne6gs2btokcacixclgfl43cv27sirgbauyyjylwpdtqd.onion/ attribute= - 2023/03/25 17:44:50 INFO: currently 64.3 sites/s, 68.3 sites/s since start - ^C2023/03/25 17:44:51 INFO: about to exit, reading remaining answers - 2023/03/25 17:44:51 NOTICE: only read up until line 2089 - 2023/03/25 17:45:01 SUMMARY: 1488/2089 connected, 8 sites configured Onion-Location +In case of errors, the type of error is identified with relatively few `???`. -[Tranco top-1M]: https://tranco-list.eu/latest_list +### Scripts -Note that `ctrl+C` can be used to exit early as shown above. To continue from -where you left off (line `2089`), specify the `-n` option on the next run: - - $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se -n 2089 - 2023/03/25 17:45:57 INFO: starting await handler, ctrl+C to exit - 2023/03/25 17:45:57 INFO: starting checker - 2023/03/25 17:45:57 INFO: starting 100 workers - 2023/03/25 17:45:57 INFO: starting work aggregator - 2023/03/25 17:45:57 INFO: generating work - cia.gov header= attribute=http://ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion - 2023/03/25 17:46:12 INFO: currently 79.9 sites/s, 79.9 sites/s since start - propublica.org header=http://p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd.onion/ attribute= - theintercept.com header=https://54dus3ggt7uxz7wjvhkia2ntxmz5lkhbvgohrwur43trt3d6vrcvfmqd.onion/ attribute= - torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute= - 2023/03/25 17:46:27 INFO: currently 75.1 sites/s, 77.5 sites/s since start - ^C2023/03/25 17:46:28 INFO: about to exit, reading remaining answers - 2023/03/25 17:46:28 NOTICE: only read up until line 4487 (line 2398 relative to start) - 2023/03/25 17:46:38 SUMMARY: 1609/2399 connected, 4 sites configured Onion-Location - -## Known issues - -### Too many parallel workers - -Here's what would happen if the local system cannot handle the number of workers: - - $ onion-grab -i top-1m.lst -w 1000 -m 15s -C 60s -c rgdd.se - 2023/03/25 17:47:36 INFO: starting await handler, ctrl+C to exit - 2023/03/25 17:47:36 INFO: starting checker - 2023/03/25 17:47:36 INFO: starting 1000 workers - 2023/03/25 17:47:36 INFO: starting work aggregator - 2023/03/25 17:47:36 INFO: generating work - 2023/03/25 17:47:36 Transport: unhandled response frame type *http.http2UnknownFrame - twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/ - 2023/03/25 17:47:37 Transport: unhandled response frame type *http.http2UnknownFrame - brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute= - 2023/03/25 17:47:51 INFO: currently 151.3 sites/s, 151.3 sites/s since start - 2023/03/25 17:48:06 INFO: currently 78.1 sites/s, 114.7 sites/s since start - 2023/03/25 17:48:21 INFO: currently 121.9 sites/s, 117.1 sites/s since start - 2023/03/25 17:48:36 INFO: currently 78.1 sites/s, 107.4 sites/s since start - 2023/03/25 17:48:46 ERROR: checker expected onion for {Domain:rgdd.se OK:false HTTP: HTML:} - 2023/03/25 17:48:46 NOTICE: only read up until line 7442 - 2023/03/25 17:48:46 INFO: about to exit, reading remaining answers - 2023/03/25 17:48:56 SUMMARY: 232/7442 connected, 2 sites configured Onion-Location - -This is most likely an OS problem; not an onion-grab problem. Debug hints: - - - Stop and disable `systemd-resolved`, then specify a recursive resolver that - can handle the expected load. - - You may need to tinker with kernel tunables, see `ulimit -a` and `sysctl -a` - for what can be configured. For example, if you find that the error is - caused by too many open files, try increasing the value of `ulimit -n`. - -**Credit:** Björn Töpel helped debugging this issue. - -**Note:** domains with Onion-Location are likely to be missed if `-n 7442` is -used here in a subsequent run. For example, with `-C 60s` and an average of 100 -domains/s, it would be wise to roll-back _at least_ 6000 lines. This should be -a last-resort option, and is mainly here to sanity-check long measurements. - -### Misc notes - -We use the default `net.Dial` function, which in turn uses -[goLookupIPCNAMEOrder][] for DNS lookups with the recursive name servers in -`/etc/resolve.conf`. For example, with - - $ cat /etc/resolve.conf - nameserver 8.8.8.8 - nameserver 8.8.4.4 - -the query will first be directed to `8.8.8.8`; then `8.8.4.4` if no valid answer -is available yet ([lines 663-778][]). - -[goLookupIPCNAMEOrder]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L595-L804 -[lines 663-778]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L663-L778 - -Also note that the default settings makes us follow redirects (at most 10). And -that 10MiB for MaxResponseHeaderBytes should be [conservative][]; so it seems -like we should not be too strict if setting this to 32MiB. - -[conservative]: https://go-review.googlesource.com/c/go/+/21329/2/src/net/http/transport.go +Digest the results, here stored as `onion-grab.stdout`: + + $ cat onion-grab.stdout + www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/ + www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute= + $ ./scripts/digest.py -i onion-grab.stdout + digest.py:25 INFO: found 1 HTTP headers with Onion-Location + digest.py:26 INFO: found 1 HTML meta attributes with Onion-Location + digest.py:27 INFO: found 2 unqiue domain names that set Onion-Location + digest.py:28 INFO: found 2 unique two-label onion addresses in the process + digest.py:30 INFO: storing domains with valid Onion-Location configurations in domains.txt + digest.py:35 INFO: storing two-label onion addresses that domains referenced in onions.txt + $ cat domains.txt + www.qubes-os.org http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/ + www.torproject.org http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html + $ cat onions.txt + qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion www.qubes-os.org + 2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion www.torproject.org + +In other words, the digest script prints some information and writes two files: + + - `domains.txt`: domains that configured valid Onion-Location headers. The + listed Onion-Location values are de-duplicated and space-separated. + - `onions.txt`: two-label `.onion` addresses that were discovered. The listed + domains referenced this address in their Onion-Location configuration, + possibly with subdomains, paths, etc., that were removed. Such pruning of + the set Onion-Location values is useful to estimate the number of onions. + +See [scripts/test.sh](./scripts/test.sh) and if you are looking to test +different `onion-grab` configuration. You may find +[scripts/measure.sh](scripts/measure.sh) to be a useful measurement script. + +## Running a larger measurement + +See [docs/operations.md](TODO) +for measurements of [Tranco top-1M][] and [ct-sans][]. + +[Tranco top-1M]: https://tranco-list.eu/latest_list +[ct-sans]: https://git.cs.kau.se/rasmoste/ct-sans/-/blob/main/docs/operations.md ## Contact diff --git a/docs/notes.md b/docs/notes.md new file mode 100644 index 0000000..d95c6f0 --- /dev/null +++ b/docs/notes.md @@ -0,0 +1,24 @@ +# Notes + +`onion-grab` uses use the default `net.Dial` function, which in turn uses +[goLookupIPCNAMEOrder][] for DNS lookups with the recursive name servers in +`/etc/resolve.conf`. For example, with + + $ cat /etc/resolve.conf + nameserver 8.8.8.8 + nameserver 8.8.4.4 + +[goLookupIPCNAMEOrder]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L595-L804 + +the query will first be directed to `8.8.8.8`; then `8.8.4.4` if no valid answer +is available yet ([lines 663-778][]). If you are running `onion-grab` with +[Mullvad VPN][], specify custom DNS: `mullvad dns set custom 8.8.8.8 8.8.4.4`. + +[lines 663-778]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L663-L778 +[Mullvad VPN]: https://www.mullvad.net/ + +Further, default settings are used to follow at most 10 HTTP 3XX redirects. A +[conservative][] value for the `MaxResponseHeaderBytes` option is 10MiB; the +`onion-grab` default is 16MiB and our measurements bumped this up to 64MiB. + +[conservative]: https://go-review.googlesource.com/c/go/+/21329/2/src/net/http/transport.go diff --git a/docs/operations.md b/docs/operations.md new file mode 100644 index 0000000..1528c32 --- /dev/null +++ b/docs/operations.md @@ -0,0 +1,3 @@ +# Operations + +Placeholder. diff --git a/docs/setup.md b/docs/setup.md new file mode 100644 index 0000000..7e4bdb8 --- /dev/null +++ b/docs/setup.md @@ -0,0 +1,19 @@ +# Setup + +`onion-grab` has been tested on Ubuntu/Debian based systems. If you are running +a large measurement, you may run into issues that are **OS related**. + +## Hints + + - We disabled and stopped `systemd-resolved`, which eventually causes some or + all DNS requests to be blocked when running with many concurrent workers. + - We used Google's `8.8.8.8` and `8.8.4.4`, which [supports 1500qps][] per IP. + - You may need to tinker with `ulimit` and `sysctl`, e.g., if observing that + there are too many open file descriptors or similar. See for example the + value of `ulimit -n` and `sysctl net.ipv4.ip_local_port_range` + +[supports 1500qps]: https://developers.google.com/speed/public-dns/docs/isp + +## Credit + +Björn Töpel helped us debug some of these OS-related issues. -- cgit v1.2.3