1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
|
# onion-grab
A tool that visits a list of domains over HTTPS to see if they have
[Onion-Location][] configured.
[Onion-Location]: https://community.torproject.org/onion-services/advanced/onion-location/
**Warning:** research prototype. The source code may also be moved.
## Quickstart
You will need a Go compiler on the local system:
$ which go >/dev/null || echo "Go compiler is not in PATH"
Install `onion-grab`:
$ go install git.cs.kau.se/rasmoste/onion-grab@latest
List all options:
$ onion-grab -h
### Basic usage
Store domains in a file; one domain per line:
$ cat domains.lst
www.eff.org
www.qubes-os.org
www.torproject.org
$ onion-grab -i domains.lst
2023/03/25 17:43:30 INFO: starting await handler, ctrl+C to exit
2023/03/25 17:43:30 INFO: starting 2 workers
2023/03/25 17:43:30 INFO: starting work aggregator
2023/03/25 17:43:30 INFO: generating work
www.qubes-os.org header= attribute=http://qubesosfasa4zl44o4tws22di6kepyzfeqv3tg4e3ztknltfxqrymdad.onion/
www.torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
2023/03/25 17:43:40 INFO: about to exit, reading remaining answers
2023/03/25 17:43:50 SUMMARY: 3/3 connected, 2 sites configured Onion-Location
Sites with Onion-Location are printed to stdout, here showing that
`www.torproject.org` configures it with an HTTP header while `www.qubes-os.org`
does it with an HTML attribute. All three sites connected successfully.
### Working with a larger list
Below the [Tranco top-1M][] list is used as an example; 100 workers are
specified, metrics are printed every 15s, and sanity-checks against a site with
Onion-Location which should be reachable are carried out every 60s.
$ cut -d ',' -f2 top-1m.csv > top-1m.lst
$ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se
2023/03/25 17:44:20 INFO: starting await handler, ctrl+C to exit
2023/03/25 17:44:20 INFO: starting checker
2023/03/25 17:44:20 INFO: starting 100 workers
2023/03/25 17:44:20 INFO: starting work aggregator
2023/03/25 17:44:20 INFO: generating work
nytimes.com header=https://www.nytimesn7cgmftshazwhfgzm37qxb44r64ytbb2dj3x62d2lljsciiyd.onion/ attribute=
twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/
theguardian.com header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute=
2023/03/25 17:44:22 Transport: unhandled response frame type *http.http2UnknownFrame
2023/03/25 17:44:31 Transport: unhandled response frame type *http.http2UnknownFrame
dw.com header=https://www.dwnewsgngmhlplxy6o2twtfgjnrnjxbegbwqx6wnotdhkzt562tszfid.onion/ attribute=
brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute=
2023/03/25 17:44:35 INFO: currently 72.3 sites/s, 72.3 sites/s since start
guardian.co.uk header=https://www.guardian2zotagl6tmjucg3lrhxdk4dw3lhbqnkvvkywawy3oqfoprid.onion/international attribute=
proton.me header=https://protonmailrmez3lotccipshtkleegetolb73fuirgj7r4o4vfu7ozyd.onion/ attribute=
voanews.com header=https://www.voanews5aitmne6gs2btokcacixclgfl43cv27sirgbauyyjylwpdtqd.onion/ attribute=
2023/03/25 17:44:50 INFO: currently 64.3 sites/s, 68.3 sites/s since start
^C2023/03/25 17:44:51 INFO: about to exit, reading remaining answers
2023/03/25 17:44:51 NOTICE: only read up until line 2089
2023/03/25 17:45:01 SUMMARY: 1488/2089 connected, 8 sites configured Onion-Location
[Tranco top-1M]: https://tranco-list.eu/latest_list
Note that `ctrl+C` can be used to exit early as shown above. To continue from
where you left off (line `2089`), specify the `-n` option on the next run:
$ onion-grab -i top-1m.lst -w 100 -m 15s -C 60s -c rgdd.se -n 2089
2023/03/25 17:45:57 INFO: starting await handler, ctrl+C to exit
2023/03/25 17:45:57 INFO: starting checker
2023/03/25 17:45:57 INFO: starting 100 workers
2023/03/25 17:45:57 INFO: starting work aggregator
2023/03/25 17:45:57 INFO: generating work
cia.gov header= attribute=http://ciadotgov4sjwlzihbbgxnqg3xiyrg7so2r2o3lt5wz5ypk4sxyjstad.onion
2023/03/25 17:46:12 INFO: currently 79.9 sites/s, 79.9 sites/s since start
propublica.org header=http://p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd.onion/ attribute=
theintercept.com header=https://54dus3ggt7uxz7wjvhkia2ntxmz5lkhbvgohrwur43trt3d6vrcvfmqd.onion/ attribute=
torproject.org header=http://2gzyxa5ihm7nsggfxnu52rck2vv4rvmdlkiu3zzui5du4xyclen53wid.onion/index.html attribute=
2023/03/25 17:46:27 INFO: currently 75.1 sites/s, 77.5 sites/s since start
^C2023/03/25 17:46:28 INFO: about to exit, reading remaining answers
2023/03/25 17:46:28 NOTICE: only read up until line 4487 (line 2398 relative to start)
2023/03/25 17:46:38 SUMMARY: 1609/2399 connected, 4 sites configured Onion-Location
## Known issues
### Too many parallel workers
Here's what would happen if the local system cannot handle the number of workers:
$ onion-grab -i top-1m.lst -w 1000 -m 15s -C 60s -c rgdd.se
2023/03/25 17:47:36 INFO: starting await handler, ctrl+C to exit
2023/03/25 17:47:36 INFO: starting checker
2023/03/25 17:47:36 INFO: starting 1000 workers
2023/03/25 17:47:36 INFO: starting work aggregator
2023/03/25 17:47:36 INFO: generating work
2023/03/25 17:47:36 Transport: unhandled response frame type *http.http2UnknownFrame
twitter.com header= attribute=https://twitter3e4tixl4xyajtrzo62zg5vztmjuricljdp2c5kshju4avyoid.onion/
2023/03/25 17:47:37 Transport: unhandled response frame type *http.http2UnknownFrame
brave.com header=https://brave4u7jddbv7cyviptqjc7jusxh72uik7zt6adtckl5f4nwy2v72qd.onion/index.html attribute=
2023/03/25 17:47:51 INFO: currently 151.3 sites/s, 151.3 sites/s since start
2023/03/25 17:48:06 INFO: currently 78.1 sites/s, 114.7 sites/s since start
2023/03/25 17:48:21 INFO: currently 121.9 sites/s, 117.1 sites/s since start
2023/03/25 17:48:36 INFO: currently 78.1 sites/s, 107.4 sites/s since start
2023/03/25 17:48:46 ERROR: checker expected onion for {Domain:rgdd.se OK:false HTTP: HTML:}
2023/03/25 17:48:46 NOTICE: only read up until line 7442
2023/03/25 17:48:46 INFO: about to exit, reading remaining answers
2023/03/25 17:48:56 SUMMARY: 232/7442 connected, 2 sites configured Onion-Location
This is most likely an OS problem; not an onion-grab problem. Debug hints:
- Stop and disable `systemd-resolved`, then specify a recursive resolver that
can handle the expected load.
- You may need to tinker with kernel tunables, see `ulimit -a` and `sysctl -a`
for what can be configured. For example, if you find that the error is
caused by too many open files, try increasing the value of `ulimit -n`.
**Credit:** Björn Töpel helped debugging this issue.
**Note:** domains with Onion-Location are likely to be missed if `-n 7442` is
used here in a subsequent run. For example, with `-C 60s` and an average of 100
domains/s, it would be wise to roll-back _at least_ 6000 lines. This should be
a last-resort option, and is mainly here to sanity-check long measurements.
## Contact
- rasmus (at) rgdd (dot) se
## Licence
BSD 2-Clause License
|