aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: c4444f620e11ea354cd4787db809ddef6df1c21a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# onion-grab

A tool that visits a list of domains over HTTPS to see if they have
[Onion-Location][] configured.

[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/

**Warning:** research prototype.  The source code may also be moved.

## Quickstart

You will need a Go compiler on the local system:

    $ which go >/dev/null || echo "Go compiler is not in PATH"

Install `onion-grab`:

    $ go install git.cs.kau.se/rasmoste/onion-grab@latest

List all options:

    $ onion-grab -h

### Basic usage

Store domains in a file; one domain per line:

    $ cat domains.lst
    www.eff.org
    www.qubes-os.org
    www.torproject.org
    $ onion-grab -i domains.lst
    2023/03/25 17:43:30 INFO: starting await handler, ctrl+C to exit
    2023/03/25 17:43:30 INFO: starting 2 workers
    2023/03/25 17:43:30 INFO: starting work aggregator
    2023/03/25 17:43:30 INFO: generating work
    www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/
    www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
    2023/03/25 17:43:40 INFO: about to exit, reading remaining answers
    2023/03/25 17:43:50 SUMMARY: 3/3 connected, 2 sites configured Onion-Location

Sites with Onion-Location are printed to stdout, here showing that
`www.torproject.org` configures it with an HTTP header while `www.qubes-os.org`
does it with an HTML attribute.  All three sites connected successfully.

### Working with a larger list

Below the [Tranco top-1M][] list is used as an example; 100 workers are
specified, metrics are printed every 15s, and sanity-checks against a site with
Onion-Location which should be reachable are carried out every 60s.

    $ cut -d ',' -f2 top-1m.csv > top-1m.lst
    $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se
    2023/03/25 17:44:20 INFO: starting await handler, ctrl+C to exit
    2023/03/25 17:44:20 INFO: starting checker
    2023/03/25 17:44:20 INFO: starting 100 workers
    2023/03/25 17:44:20 INFO: starting work aggregator
    2023/03/25 17:44:20 INFO: generating work
    nytimes.com header=https://www.nytimesn7cgmftshazwhfgzm37qxb44r64ytbb2dj3x62d2lljsciiyd.onion/ attribute=
    twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/
    theguardian.com header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute=
    2023/03/25 17:44:22 Transport: unhandled response frame type *http.http2UnknownFrame
    2023/03/25 17:44:31 Transport: unhandled response frame type *http.http2UnknownFrame
    dw.com header=https://www.dwnewsgngmhlplxy6o2twtfgjnrnjxbegbwqx6wnotdhkzt562tszfid.onion/ attribute=
    brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute=
    2023/03/25 17:44:35 INFO: currently 72.3 sites/s, 72.3 sites/s since start
    guardian.co.uk header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute=
    proton.me header=https://protonmailrmez3lotccipshtkleegetolb73fuirgj7r4o4vfu7ozyd.onion/ attribute=
    voanews.com header=https://www.voanews5aitmne6gs2btokcacixclgfl43cv27sirgbauyyjylwpdtqd.onion/ attribute=
    2023/03/25 17:44:50 INFO: currently 64.3 sites/s, 68.3 sites/s since start
    ^C2023/03/25 17:44:51 INFO: about to exit, reading remaining answers
    2023/03/25 17:44:51 NOTICE: only read up until line 2089
    2023/03/25 17:45:01 SUMMARY: 1488/2089 connected, 8 sites configured Onion-Location

[Tranco top-1M]: https://tranco-list.eu/latest_list

Note that `ctrl+C` can be used to exit early as shown above.  To continue from
where you left off (line `2089`), specify the `-n` option on the next run:

    $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se -n 2089
    2023/03/25 17:45:57 INFO: starting await handler, ctrl+C to exit
    2023/03/25 17:45:57 INFO: starting checker
    2023/03/25 17:45:57 INFO: starting 100 workers
    2023/03/25 17:45:57 INFO: starting work aggregator
    2023/03/25 17:45:57 INFO: generating work
    cia.gov header= attribute=http://ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion
    2023/03/25 17:46:12 INFO: currently 79.9 sites/s, 79.9 sites/s since start
    propublica.org header=http://p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd.onion/ attribute=
    theintercept.com header=https://54dus3ggt7uxz7wjvhkia2ntxmz5lkhbvgohrwur43trt3d6vrcvfmqd.onion/ attribute=
    torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
    2023/03/25 17:46:27 INFO: currently 75.1 sites/s, 77.5 sites/s since start
    ^C2023/03/25 17:46:28 INFO: about to exit, reading remaining answers
    2023/03/25 17:46:28 NOTICE: only read up until line 4487 (line 2398 relative to start)
    2023/03/25 17:46:38 SUMMARY: 1609/2399 connected, 4 sites configured Onion-Location

## Known issues

### Too many parallel workers

Here's what would happen if the local system cannot handle the number of workers:

    $ onion-grab -i top-1m.lst -w 1000 -m 15s -C 60s -c rgdd.se
    2023/03/25 17:47:36 INFO: starting await handler, ctrl+C to exit
    2023/03/25 17:47:36 INFO: starting checker
    2023/03/25 17:47:36 INFO: starting 1000 workers
    2023/03/25 17:47:36 INFO: starting work aggregator
    2023/03/25 17:47:36 INFO: generating work
    2023/03/25 17:47:36 Transport: unhandled response frame type *http.http2UnknownFrame
    twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/
    2023/03/25 17:47:37 Transport: unhandled response frame type *http.http2UnknownFrame
    brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute=
    2023/03/25 17:47:51 INFO: currently 151.3 sites/s, 151.3 sites/s since start
    2023/03/25 17:48:06 INFO: currently 78.1 sites/s, 114.7 sites/s since start
    2023/03/25 17:48:21 INFO: currently 121.9 sites/s, 117.1 sites/s since start
    2023/03/25 17:48:36 INFO: currently 78.1 sites/s, 107.4 sites/s since start
    2023/03/25 17:48:46 ERROR: checker expected onion for {Domain:rgdd.se OK:false HTTP: HTML:}
    2023/03/25 17:48:46 NOTICE: only read up until line 7442
    2023/03/25 17:48:46 INFO: about to exit, reading remaining answers
    2023/03/25 17:48:56 SUMMARY: 232/7442 connected, 2 sites configured Onion-Location

This is most likely an OS problem; not an onion-grab problem.  Debug hints:

  - Stop and disable `systemd-resolved`, then specify a recursive resolver that
    can handle the expected load.
  - You may need to tinker with kernel tunables, see `ulimit -a` and `sysctl -a`
    for what can be configured.  For example, if you find that the error is
    caused by too many open files, try increasing the value of `ulimit -n`.

**Credit:** Björn Töpel helped debugging this issue.

**Note:** domains with Onion-Location are likely to be missed if `-n 7442` is
used here in a subsequent run.  For example, with `-C 60s` and an average of 100
domains/s, it would be wise to roll-back _at least_ 6000 lines.  This should be
a last-resort option, and is mainly here to sanity-check long measurements.


## Contact

  - rasmus (at) rgdd (dot) se

## Licence

BSD 2-Clause License