diff options
| -rw-r--r-- | README.md | 203 | ||||
| -rw-r--r-- | docs/notes.md | 24 | ||||
| -rw-r--r-- | docs/operations.md | 3 | ||||
| -rw-r--r-- | docs/setup.md | 19 | 
4 files changed, 130 insertions, 119 deletions
| @@ -7,14 +7,16 @@ A tool that visits a list of domains over HTTPS to see if they have  **Warning:** research prototype.  The source code may also be moved. -**TODO:** update and clean-up README. -  ## Quickstart -You will need a Go compiler on the local system: +### Install + +You will need a [Go compiler][] on the local system:      $ which go >/dev/null || echo "Go compiler is not in PATH" +[Go compiler]: https://go.dev/doc/install +  Install `onion-grab`:      $ go install git.cs.kau.se/rasmoste/onion-grab@latest @@ -25,137 +27,100 @@ List all options:  ### Basic usage -Store domains in a file; one domain per line: +Store one domain per line in a file:      $ cat domains.lst      www.eff.org      www.qubes-os.org      www.torproject.org + +Run onion-grab with default parameters: +      $ onion-grab -i domains.lst -    2023/03/25 17:43:30 INFO: starting await handler, ctrl+C to exit -    2023/03/25 17:43:30 INFO: starting 2 workers -    2023/03/25 17:43:30 INFO: starting work aggregator -    2023/03/25 17:43:30 INFO: generating work +    2023/04/07 20:29:45 INFO: ctrl+C to exit prematurely +    2023/04/07 20:29:45 INFO: starting 128 workers with limit 64/s +    2023/04/07 20:29:45 INFO: starting work receiver +    2023/04/07 20:29:45 INFO: starting work generator      www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/      www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute= -    2023/03/25 17:43:40 INFO: about to exit, reading remaining answers -    2023/03/25 17:43:50 SUMMARY: 3/3 connected, 2 sites configured Onion-Location +    2023/04/07 20:29:50 INFO: metrics@receiver: +     +      Processed: 3 +        Success: 3 (Onion-Location:2) +        Failure: 0 (See breakdown below) +            Req: 0 (Before sending request) +            DNS: 0 (NotFound:0 Timeout:0 Other:0) +            TCP: 0 (Timeout:0 Syscall:0) +            TLS: 0 (Cert:0 Other:0) +            3xx: 0 (Too many redirects) +            EOF: 0 (Unclear meaning) +            CTX: 0 (Deadline exceeded) +            ???: 0 (Other errors) +     +    2023/04/07 20:29:51 INFO: about to exit in at most 11s, reading remaining answers +    2023/04/07 20:29:57 INFO: metrics@receiver: summary: +     +      Processed: 3 +        Success: 3 (Onion-Location:2) +        Failure: 0 (See breakdown below) +            Req: 0 (Before sending request) +            DNS: 0 (NotFound:0 Timeout:0 Other:0) +            TCP: 0 (Timeout:0 Syscall:0) +            TLS: 0 (Cert:0 Other:0) +            3xx: 0 (Too many redirects) +            EOF: 0 (Unclear meaning) +            CTX: 0 (Deadline exceeded) +            ???: 0 (Other errors) +     +    2023/04/07 20:29:57 INFO: measurement duration was 12s  Sites with Onion-Location are printed to stdout, here showing that  `www.torproject.org` configures it with an HTTP header while `www.qubes-os.org`  does it with an HTML attribute.  All three sites connected successfully. -### Working with a larger list - -Below the [Tranco top-1M][] list is used as an example; 100 workers are -specified, metrics are printed every 15s, and sanity-checks against a site with -Onion-Location which should be reachable are carried out every 60s. - -    $ cut -d ',' -f2 top-1m.csv > top-1m.lst -    $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se -    2023/03/25 17:44:20 INFO: starting await handler, ctrl+C to exit -    2023/03/25 17:44:20 INFO: starting checker -    2023/03/25 17:44:20 INFO: starting 100 workers -    2023/03/25 17:44:20 INFO: starting work aggregator -    2023/03/25 17:44:20 INFO: generating work -    nytimes.com header=https://www.nytimesn7cgmftshazwhfgzm37qxb44r64ytbb2dj3x62d2lljsciiyd.onion/ attribute= -    twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/ -    theguardian.com header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute= -    2023/03/25 17:44:22 Transport: unhandled response frame type *http.http2UnknownFrame -    2023/03/25 17:44:31 Transport: unhandled response frame type *http.http2UnknownFrame -    dw.com header=https://www.dwnewsgngmhlplxy6o2twtfgjnrnjxbegbwqx6wnotdhkzt562tszfid.onion/ attribute= -    brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute= -    2023/03/25 17:44:35 INFO: currently 72.3 sites/s, 72.3 sites/s since start -    guardian.co.uk header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute= -    proton.me header=https://protonmailrmez3lotccipshtkleegetolb73fuirgj7r4o4vfu7ozyd.onion/ attribute= -    voanews.com header=https://www.voanews5aitmne6gs2btokcacixclgfl43cv27sirgbauyyjylwpdtqd.onion/ attribute= -    2023/03/25 17:44:50 INFO: currently 64.3 sites/s, 68.3 sites/s since start -    ^C2023/03/25 17:44:51 INFO: about to exit, reading remaining answers -    2023/03/25 17:44:51 NOTICE: only read up until line 2089 -    2023/03/25 17:45:01 SUMMARY: 1488/2089 connected, 8 sites configured Onion-Location +In case of errors, the type of error is identified with relatively few `???`. -[Tranco top-1M]: https://tranco-list.eu/latest_list +### Scripts -Note that `ctrl+C` can be used to exit early as shown above.  To continue from -where you left off (line `2089`), specify the `-n` option on the next run: - -    $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se -n 2089 -    2023/03/25 17:45:57 INFO: starting await handler, ctrl+C to exit -    2023/03/25 17:45:57 INFO: starting checker -    2023/03/25 17:45:57 INFO: starting 100 workers -    2023/03/25 17:45:57 INFO: starting work aggregator -    2023/03/25 17:45:57 INFO: generating work -    cia.gov header= attribute=http://ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion -    2023/03/25 17:46:12 INFO: currently 79.9 sites/s, 79.9 sites/s since start -    propublica.org header=http://p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd.onion/ attribute= -    theintercept.com header=https://54dus3ggt7uxz7wjvhkia2ntxmz5lkhbvgohrwur43trt3d6vrcvfmqd.onion/ attribute= -    torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute= -    2023/03/25 17:46:27 INFO: currently 75.1 sites/s, 77.5 sites/s since start -    ^C2023/03/25 17:46:28 INFO: about to exit, reading remaining answers -    2023/03/25 17:46:28 NOTICE: only read up until line 4487 (line 2398 relative to start) -    2023/03/25 17:46:38 SUMMARY: 1609/2399 connected, 4 sites configured Onion-Location - -## Known issues - -### Too many parallel workers - -Here's what would happen if the local system cannot handle the number of workers: - -    $ onion-grab -i top-1m.lst -w 1000 -m 15s -C 60s -c rgdd.se -    2023/03/25 17:47:36 INFO: starting await handler, ctrl+C to exit -    2023/03/25 17:47:36 INFO: starting checker -    2023/03/25 17:47:36 INFO: starting 1000 workers -    2023/03/25 17:47:36 INFO: starting work aggregator -    2023/03/25 17:47:36 INFO: generating work -    2023/03/25 17:47:36 Transport: unhandled response frame type *http.http2UnknownFrame -    twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/ -    2023/03/25 17:47:37 Transport: unhandled response frame type *http.http2UnknownFrame -    brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute= -    2023/03/25 17:47:51 INFO: currently 151.3 sites/s, 151.3 sites/s since start -    2023/03/25 17:48:06 INFO: currently 78.1 sites/s, 114.7 sites/s since start -    2023/03/25 17:48:21 INFO: currently 121.9 sites/s, 117.1 sites/s since start -    2023/03/25 17:48:36 INFO: currently 78.1 sites/s, 107.4 sites/s since start -    2023/03/25 17:48:46 ERROR: checker expected onion for {Domain:rgdd.se OK:false HTTP: HTML:} -    2023/03/25 17:48:46 NOTICE: only read up until line 7442 -    2023/03/25 17:48:46 INFO: about to exit, reading remaining answers -    2023/03/25 17:48:56 SUMMARY: 232/7442 connected, 2 sites configured Onion-Location - -This is most likely an OS problem; not an onion-grab problem.  Debug hints: - -  - Stop and disable `systemd-resolved`, then specify a recursive resolver that -    can handle the expected load. -  - You may need to tinker with kernel tunables, see `ulimit -a` and `sysctl -a` -    for what can be configured.  For example, if you find that the error is -    caused by too many open files, try increasing the value of `ulimit -n`. - -**Credit:** Björn Töpel helped debugging this issue. - -**Note:** domains with Onion-Location are likely to be missed if `-n 7442` is -used here in a subsequent run.  For example, with `-C 60s` and an average of 100 -domains/s, it would be wise to roll-back _at least_ 6000 lines.  This should be -a last-resort option, and is mainly here to sanity-check long measurements. - -### Misc notes - -We use the default `net.Dial` function, which in turn uses -[goLookupIPCNAMEOrder][] for DNS lookups with the recursive name servers in -`/etc/resolve.conf`.  For example, with - -    $ cat /etc/resolve.conf -    nameserver 8.8.8.8 -    nameserver 8.8.4.4 - -the query will first be directed to `8.8.8.8`; then `8.8.4.4` if no valid answer -is available yet ([lines 663-778][]). - -[goLookupIPCNAMEOrder]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L595-L804 -[lines 663-778]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L663-L778 - -Also note that the default settings makes us follow redirects (at most 10).  And -that 10MiB for MaxResponseHeaderBytes should be [conservative][]; so it seems -like we should not be too strict if setting this to 32MiB. - -[conservative]: https://go-review.googlesource.com/c/go/+/21329/2/src/net/http/transport.go +Digest the results, here stored as `onion-grab.stdout`: + +    $ cat onion-grab.stdout +    www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/ +    www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute= +    $ ./scripts/digest.py -i onion-grab.stdout +    digest.py:25 INFO: found 1 HTTP headers with Onion-Location +    digest.py:26 INFO: found 1 HTML meta attributes with Onion-Location +    digest.py:27 INFO: found 2 unqiue domain names that set Onion-Location +    digest.py:28 INFO: found 2 unique two-label onion addresses in the process +    digest.py:30 INFO: storing domains with valid Onion-Location configurations in domains.txt +    digest.py:35 INFO: storing two-label onion addresses that domains referenced in onions.txt +    $ cat domains.txt +    www.qubes-os.org http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/ +    www.torproject.org http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html +    $ cat onions.txt +    qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion www.qubes-os.org +    2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion www.torproject.org + +In other words, the digest script prints some information and writes two files: + +  - `domains.txt`: domains that configured valid Onion-Location headers.  The +    listed Onion-Location values are de-duplicated and space-separated. +  - `onions.txt`: two-label `.onion` addresses that were discovered.  The listed +    domains referenced this address in their Onion-Location configuration, +    possibly with subdomains, paths, etc., that were removed.  Such pruning of +    the set Onion-Location values is useful to estimate the number of onions. + +See [scripts/test.sh](./scripts/test.sh) and if you are looking to test +different `onion-grab` configuration.  You may find +[scripts/measure.sh](scripts/measure.sh) to be a useful measurement script. + +## Running a larger measurement + +See [docs/operations.md](TODO) +for measurements of [Tranco top-1M][] and [ct-sans][]. + +[Tranco top-1M]: https://tranco-list.eu/latest_list +[ct-sans]: https://git.cs.kau.se/rasmoste/ct-sans/-/blob/main/docs/operations.md  ## Contact diff --git a/docs/notes.md b/docs/notes.md new file mode 100644 index 0000000..d95c6f0 --- /dev/null +++ b/docs/notes.md @@ -0,0 +1,24 @@ +# Notes + +`onion-grab` uses use the default `net.Dial` function, which in turn uses +[goLookupIPCNAMEOrder][] for DNS lookups with the recursive name servers in +`/etc/resolve.conf`.  For example, with + +    $ cat /etc/resolve.conf +    nameserver 8.8.8.8 +    nameserver 8.8.4.4 + +[goLookupIPCNAMEOrder]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L595-L804 + +the query will first be directed to `8.8.8.8`; then `8.8.4.4` if no valid answer +is available yet ([lines 663-778][]).  If you are running `onion-grab` with +[Mullvad VPN][], specify custom DNS: `mullvad dns set custom 8.8.8.8 8.8.4.4`. + +[lines 663-778]: https://github.com/golang/go/blob/8edcdddb23c6d3f786b465c43b49e8d9a0015082/src/net/dnsclient_unix.go#L663-L778 +[Mullvad VPN]: https://www.mullvad.net/ + +Further, default settings are used to follow at most 10 HTTP 3XX redirects.  A +[conservative][] value for the `MaxResponseHeaderBytes` option is 10MiB; the +`onion-grab` default is 16MiB and our measurements bumped this up to 64MiB. + +[conservative]: https://go-review.googlesource.com/c/go/+/21329/2/src/net/http/transport.go diff --git a/docs/operations.md b/docs/operations.md new file mode 100644 index 0000000..1528c32 --- /dev/null +++ b/docs/operations.md @@ -0,0 +1,3 @@ +# Operations + +Placeholder. diff --git a/docs/setup.md b/docs/setup.md new file mode 100644 index 0000000..7e4bdb8 --- /dev/null +++ b/docs/setup.md @@ -0,0 +1,19 @@ +# Setup + +`onion-grab` has been tested on Ubuntu/Debian based systems.  If you are running +a large measurement, you may run into issues that are **OS related**. + +## Hints + +  - We disabled and stopped `systemd-resolved`, which eventually causes some or +    all DNS requests to be blocked when running with many concurrent workers. +  - We used Google's `8.8.8.8` and `8.8.4.4`, which [supports 1500qps][] per IP. +  - You may need to tinker with `ulimit` and `sysctl`, e.g., if observing that +    there are too many open file descriptors or similar.  See for example the +    value of `ulimit -n` and `sysctl net.ipv4.ip_local_port_range` + +[supports 1500qps]: https://developers.google.com/speed/public-dns/docs/isp + +## Credit + +Björn Töpel helped us debug some of these OS-related issues. | 
