aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 8356ba45d80bf537741126b813df6c58e55ae1c9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# onion-grab

A tool that visits a list of domains over HTTPS to see if they have
[Onion-Location][] configured.

[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/

**Warning:** research prototype.  The source code may also be moved.

## Quickstart

You will need a Go compiler on the local system:

    $ which go >/dev/null || echo "Go compiler is not in PATH"

Install `onion-grab`:

    $ go install git.cs.kau.se/rasmoste/onion-grab@latest

List all options:

    $ onion-grab -h

### Basic usage

Store domains in a file; one domain per line:

    $ cat domains.lst
    www.eff.org
    www.qubes-os.org
    www.torproject.org
    $ onion-grab -i domains.lst
    2023/03/25 17:43:30 INFO: starting await handler, ctrl+C to exit
    2023/03/25 17:43:30 INFO: starting 2 workers
    2023/03/25 17:43:30 INFO: starting work aggregator
    2023/03/25 17:43:30 INFO: generating work
    www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/
    www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
    2023/03/25 17:43:40 INFO: about to exit, reading remaining answers
    2023/03/25 17:43:50 SUMMARY: 3/3 connected, 2 sites configured Onion-Location

Sites with Onion-Location are printed to stdout, here showing that
`www.torproject.org` configures it with an HTTP header while `www.qubes-os.org`
does it with an HTML attribute.  All three sites connected successfully.

### Working with a larger list

Below the [Tranco top-1M][] list is used as an example; 100 workers are
specified, metrics are printed every 15s, and sanity-checks against a site with
Onion-Location which should be reachable are carried out every 60s.

    $ cut -d ',' -f2 top-1m.csv > top-1m.lst
    $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se
    2023/03/25 17:44:20 INFO: starting await handler, ctrl+C to exit
    2023/03/25 17:44:20 INFO: starting checker
    2023/03/25 17:44:20 INFO: starting 100 workers
    2023/03/25 17:44:20 INFO: starting work aggregator
    2023/03/25 17:44:20 INFO: generating work
    nytimes.com header=https://www.nytimesn7cgmftshazwhfgzm37qxb44r64ytbb2dj3x62d2lljsciiyd.onion/ attribute=
    twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/
    theguardian.com header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute=
    2023/03/25 17:44:22 Transport: unhandled response frame type *http.http2UnknownFrame
    2023/03/25 17:44:31 Transport: unhandled response frame type *http.http2UnknownFrame
    dw.com header=https://www.dwnewsgngmhlplxy6o2twtfgjnrnjxbegbwqx6wnotdhkzt562tszfid.onion/ attribute=
    brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute=
    2023/03/25 17:44:35 INFO: currently 72.3 sites/s, 72.3 sites/s since start
    guardian.co.uk header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute=
    proton.me header=https://protonmailrmez3lotccipshtkleegetolb73fuirgj7r4o4vfu7ozyd.onion/ attribute=
    voanews.com header=https://www.voanews5aitmne6gs2btokcacixclgfl43cv27sirgbauyyjylwpdtqd.onion/ attribute=
    2023/03/25 17:44:50 INFO: currently 64.3 sites/s, 68.3 sites/s since start
    ^C2023/03/25 17:44:51 INFO: about to exit, reading remaining answers
    2023/03/25 17:44:51 NOTICE: only read up until line 2089
    2023/03/25 17:45:01 SUMMARY: 1488/2089 connected, 8 sites configured Onion-Location

[Tranco top-1M]: https://tranco-list.eu/latest_list

Note that `ctrl+C` can be used to exit early as shown above.  To continue from
where you left off (line `2089`), specify the `-n` option on the next run:

    $ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se -n 2089
    2023/03/25 17:45:57 INFO: starting await handler, ctrl+C to exit
    2023/03/25 17:45:57 INFO: starting checker
    2023/03/25 17:45:57 INFO: starting 100 workers
    2023/03/25 17:45:57 INFO: starting work aggregator
    2023/03/25 17:45:57 INFO: generating work
    cia.gov header= attribute=http://ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion
    2023/03/25 17:46:12 INFO: currently 79.9 sites/s, 79.9 sites/s since start
    propublica.org header=http://p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd.onion/ attribute=
    theintercept.com header=https://54dus3ggt7uxz7wjvhkia2ntxmz5lkhbvgohrwur43trt3d6vrcvfmqd.onion/ attribute=
    torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
    2023/03/25 17:46:27 INFO: currently 75.1 sites/s, 77.5 sites/s since start
    ^C2023/03/25 17:46:28 INFO: about to exit, reading remaining answers
    2023/03/25 17:46:28 NOTICE: only read up until line 4487 (line 2398 relative to start)
    2023/03/25 17:46:38 SUMMARY: 1609/2399 connected, 4 sites configured Onion-Location

## Known issues

### Too many parallel workers

Here's what would happen if the local system cannot handle the number of workers:

    $ onion-grab -i top-1m.lst -w 1000 -m 15s -C 60s -c rgdd.se
    2023/03/25 17:47:36 INFO: starting await handler, ctrl+C to exit
    2023/03/25 17:47:36 INFO: starting checker
    2023/03/25 17:47:36 INFO: starting 1000 workers
    2023/03/25 17:47:36 INFO: starting work aggregator
    2023/03/25 17:47:36 INFO: generating work
    2023/03/25 17:47:36 Transport: unhandled response frame type *http.http2UnknownFrame
    twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/
    2023/03/25 17:47:37 Transport: unhandled response frame type *http.http2UnknownFrame
    brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute=
    2023/03/25 17:47:51 INFO: currently 151.3 sites/s, 151.3 sites/s since start
    2023/03/25 17:48:06 INFO: currently 78.1 sites/s, 114.7 sites/s since start
    2023/03/25 17:48:21 INFO: currently 121.9 sites/s, 117.1 sites/s since start
    2023/03/25 17:48:36 INFO: currently 78.1 sites/s, 107.4 sites/s since start
    2023/03/25 17:48:46 ERROR: checker expected onion for {Domain:rgdd.se OK:false HTTP: HTML:}
    2023/03/25 17:48:46 NOTICE: only read up until line 7442
    2023/03/25 17:48:46 INFO: about to exit, reading remaining answers
    2023/03/25 17:48:56 SUMMARY: 232/7442 connected, 2 sites configured Onion-Location

On a Debian system, it appears that all future HTTP GET requests made by
`onion-grab` will fail if a worker overload happens.  The exact cause is
unclear.  Other programs may be affected too, e.g., `curl` and `Firefox`.

To get back into a normal state, try:

    # systemctl restart systemd-resolved

**Note:** domains with Onion-Location are likely to be missed if `-n 7442` is
used here in a subsequent run.  For example, with `-C 60s` and an average of 100
domains/s, it would be wise to roll-back _at least_ 6000 lines.

Get in touch if you know a fix, e.g., based on `ulimit` and `sysctl` tinkering.

## Contact

  - rasmus (at) rgdd (dot) se