onion-grab dataset

This document describes our onion-grab data collection, including information about the local systems and a timeline for our operations leading up to the results for Tranco top-1m and SANs in CT logs during April, 2023.

Summary

The time to conduct initial tests against Tranco top-1m was ~1 day. 207 unique two-label .onion domains were found from 285 Onion-Location sites.

The time to conduct the full measurement for SANs in CT logs was ~10 days. 3330 unique two-label .onion domains were configured from 26937 unique sites. 13956 of those "unique sites" had the same Onion-Location configuration as Twitter. At first this was surprising, but it was eventually explained by onion-grab following redirects without attributing the configured Onion-Location with the redirected destination, see [20] towards the bottom of the timeline.

The collected data sets are available here:

https://dart.cse.kau.se/ol-measurements-and-fp/onion-grab/2023-04-03-tranco.zip, sha256sum 1f4a0b4009486bce83262f8e3a58ec50757c3f49305cfa427dadbb10dc4b8c1b
https://dart.cse.kau.se/ol-measurements-and-fp/onion-grab/2023-04-03-ct-sans.zip, sha256sum 8d476da6077c7bff2c0afbe444344c9549ad0d1b64cacfd525a7c65dec68529c

For further information about system configurations and operations, read on.

Local systems

We have three mostly identical Ubuntu VMs:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

VM-1 is configured with 62.9GiB RAM, one CPU core with 32 CPU threads, and a ~2TiB SSD:

$ grep MemTotal /proc/meminfo
processor /proc/cpuinfoemTotal:       65948412 keand 
$ grep -c processor /proc/cpuinfo
32
$ grep 'cpu cores' /proc/cpuinfo | uniq
cpu cores       : 1
$ df -BG /home
Filesystem                        1G-blocks  Used Available Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv     2077G  220G     1772G  12% /

VM-2 and VM-3 are configured with 62.9GiB RAM, one CPU core with 16 CPU threads, and a ~60TiB SSD (each):

$ grep MemTotal /proc/meminfo
MemTotal:       65822508 kB
$ grep -c processor /proc/cpuinfo
16
$ grep 'cpu cores' /proc/cpuinfo | uniq
cpu cores       : 1
$ df -BG /home
Filesystem                        1G-blocks  Used Available Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv       61G   11G       48G  18% /

These VMs share a 1x10Gbps link with other network VMs that we have no control over. We installed vnstat to track our bandwidth-usage over time:

# apt install vnstat
# systemctl enable vnstat.service
# systemctl start vnstat.service

We also installed Go version 1.20, see install instructions:

$ go version
go version go1.20.2 linux/amd64

Stopped and disabled systemd-resolved, populating /etc/resolv.conf with

$ cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4

which gives us a setup that supports 1500 DNS look-ups per VM.

We set

$ ulimit -Sn 100000
# sysctl -w net.ipv4.ip_local_port_range="1024 65535"

before running onion-grab. The complete outputs of these commands with -a are available in our dataset. The versions of onion-grab are listed below.

Finally, we installed Mullvad VPN so that our onion-grab measurements can run from Melbourne (VM-1), New York (VM-2) and Stockholm (VM-1). Remember to set the same DNS resolvers as above (mullvad dns set custom 8.8.8.8 8.8.4.4).

In the full measurement, we had to replace Stockholm with Frankfurt (see notes).

Timeline

date	time (UTC)	event	notes
2023/04/02	23:26:27	test run with tranco top-1m	to estimate reasonable repetition count [1]
2023/04/03	12:47:43	test run with tranco top-1m	to estimate reasonable repetition count [1]
2023/04/03	17:20:00	shuffle ct-sans dataset	deterministic per-VM seed, 15m/shuffle [2]
2023/04/03	18:18:47	test run with tranco top-1m	to estimate reasonable repetition count [1]
2023/04/03	20:03	transfer shuffled dataset	from VM-1 to VM-2 (1-3MB/s, painfully slow)
2023/04/03	20:03	transfer shuffled dataset	from VM-1 to VM-3 (1-3MB/s, painfully slow)
2023/04/03	22:36:06	start onion-grab (au mel)	checkout v0.0.2, set measure.sh params [3]
2023/04/03	22:35:36	start onion-grab (us ny)	checkout v0.0.2, set measure.sh params [4]
2023/04/03	22:35:38	start onion-grab (se sto)	checkout v0.0.2, set measure.sh params [5]
2023/04/04	15:30	se sto relay bw drop	store vnstat -h stats w/ daily cron job [6]
2023/04/05	06:30	kill onion-grab (se sto)	all Stockholm relays are very slow [7]
2023/04/05	07:02:13	start onion-grab (de fra)	all Swedish relays are very slow [8]
2023/04/11	04:26:26	us nyc completed	minor exit bug [9]
2023/04/11	04:30:28	au mel completed	minor exit bug [9]
2023/04/11	20:25:50	de fra stopped	ran out of memory for unknown reason [10]
2023/04/11	22:36:25	de fra started again	use start line we know is processed [10,11]
2023/04/11	20:25:50	de fra stopped	ran out of memory for unknown reason [12]
2023/04/12	08:42:30	de fra started again	use start line we know is processed [12,13]
2023/04/12	11:50	prepare dataset (au mel)	only moving files on VM-1 [14]
2023/04/12	14:00	prepare dataset (us nyc)	moving files on VM-2, transfer to VM-1 [15]
2023/04/12	16:50	prepare dataset (se sto)	moving files on VM-3, transfer to VM-1 [16]
2023/04/12	17:00	save bandwidths at VM-{1,2}	forgot to move them earlier [17]
2023/04/13	00:35:38	de fra completed	minor exit bug [18]
2023/04/13	05:40	prepare dataset (de fra)	moving files on VM-3, transfer to VM-1 [19]
2023/04/13	05:50	experiment is completed	datasets are ready, zipped, and documented
2023/07/06		move source to tpo gitlab	git.cs.kau.se/rasmoste is not a stable home
2024/07/16		onion-grab bug report	wrt. how redirects are followed [20]

Notes

1

We downloaded Tranco top-1m, permalink Z2XKG (2023-04-03):

$ sha256sum tranco_Z2XKG-1m.csv.zip
3e078a84e9aae7dbaf1207aac000038f1e51e20e8ccc35563da8b175d38a39dd  tranco_Z2XKG-1m.csv.zip 
$ unzip tranco_Z2XKG-1m.csv.zip
$ cut -d',' -f2 top-1m.csv > top-1m.lst

This gives us a list of 1M domains to perform test-runs on. The idea:

Make visits at a wanted rate (1450/s, below the 1500 DNS lookup limit)
Make visits at several slower rates (100/s, ..., 1400/s)
Repeat this from three locations (Stockholm, New York, Melbourne)
Hypothesis: observe that the same number of Onion-Location setups are discovered when running at the most rapid rate from three locations when compared to a lower rate at the same three locations; and that the error rates are roughly the same regardless of if we use a lower or higher rate.

We used onion-grab's scripts/test.sh to perform the above experiment from VM-1. The link for downloading the data is listed above in the summary. You should see 3 subdirectories with results from 28 different measurements.

Let's look at the results in more detail: the error rates that are printed in the stderr.txt files, as well as the parsed output using scripts/digest.py.

Scan: Stockholm with limit 1450/s

$ digest.py -i 20230402-232627/se17-wireguard-l1450.txt 2>&1 |
tail -n6 | head -n4
digest.py:25 INFO: found 245 HTTP headers with Onion-Location
digest.py:26 INFO: found 42 HTML meta attributes with Onion-Location
digest.py:27 INFO: found 283 unqiue domain names that set Onion-Location
digest.py:28 INFO: found 205 unique two-label onion addresses in the process

Scan: Stockholm, New York, Melbourne with limit 1450s (combined)

$ digest.py -i 20230402-232627/*l1450.txt 2>&1 | tail -n4 | head -n2
digest.py:27 INFO: found 285 unqiue domain names that set Onion-Location
digest.py:28 INFO: found 207 unique two-label onion addresses in the process

Note that we found more Onion-Location setups here with the combined scan.

Scan: Stockholm, New York, Melbourne with limits 100, 500, 1450 (combined)

$ cat 20230402-232627/stderr.txt | tail -n5 | head -n2
digest.py:27 INFO: found 285 unqiue domain names that set Onion-Location
digest.py:28 INFO: found 207 unique two-label onion addresses in the process

Note that we did not find more Onion-Location setups now with 9x measurements. This observation holds true if scripts/digest.py is run with all 28 outputs:

$ ./scripts/digest.py -i\
          20230402-232627/*-*-*\
          20230403-124743/*-*-*\
          20230403-181847/*-*-* 2>&1 | tail -n4 | head -n2
digest.py:27 INFO: found 285 unqiue domain names that set Onion-Location
digest.py:28 INFO: found 207 unique two-label onion addresses in the process

Error rates

Below are some pretty-printed output from the error rates shown in the respective stderr.txt files, ordered by the relay and limit that we set. The maximum number of connects is 1M; all columns after that provide info about failed connection attempts. E.g., the first row has 82814 DNS lookup errors.

relay hostname	limit/s	rate/s	onions	connected	dns (NotFound/Timeout/Other)	tcp (Timeout/Syscall)	tls (Cert/Other)	3xx	eof	ctx	???
us18-wireguard	100	100.0	287	711816	82814 (72767 843 9204)	51543 (21279 30264)	87147 (77235 9912)	2042	5449	58932	257
us18-wireguard	500	500.3	285	711373	83333 (72811 1304 9218)	54058 (24064 29994)	86728 (76803 9925)	2160	5414	56689	245
us18-wireguard	1000	1001.0	286	711081	82882 (72804 852 9226)	54763 (24599 30164)	86840 (77011 9829)	1760	5086	57333	255
us18-wireguard	1200	1201.5	286	711741	82841 (72800 855 9186)	53041 (22654 30387)	86885 (77111 9774)	1803	4955	58485	249
us18-wireguard	1400	1402.1	287	710481	82894 (72805 1468 8621)	59711 (29489 30222)	86597 (76897 9700)	1638	4975	53450	254
us18-wireguard	1450	1452.2	287	708649	82866 (72820 1272 8774)	60294 (30460 29834)	86506 (76602 9904)	1887	5233	54298	267

relay hostname	limit/s	rate/s	onions	connected	dns (NotFound/Timeout/Other)	tcp (Timeout/Syscall)	tls (Cert/Other)	3xx	eof	ctx	???
au-syd-wg-002	100	100.0	285	723854	83319 (72800 1317 9202)	48693 (14767 33926)	91658 (81324 10334)	1810	5235	45149	282
au-syd-wg-002	500	500.3	285	723410	83119 (72791 1119 9209)	51229 (16767 34462)	91585 (81208 10377)	1830	4680	43876	271
au-syd-wg-002	1000	1001.0	285	724144	83052 (72771 1075 9206)	50697 (16591 34106)	91678 (81442 10236)	1491	4922	43733	283
au-syd-wg-002	1200	1192.3	286	723169	83090 (72820 1122 9148)	51408 (16685 34723)	91571 (81354 10217)	1413	5024	44052	273
au-syd-wg-002	1400	1391.8	286	721119	83305 (72796 1906 8603)	55236 (21640 33596)	91339 (81197 10142)	842	5752	42124	283
au-syd-wg-002	1450	1431.3	285	720439	83182 (72793 1498 8891)	56817 (23193 33624)	91376 (81049 10327)	1100	5486	41334	266

relay hostname	limit/s	rate/s	onions	connected	dns (NotFound/Timeout/Other)	tcp (Timeout/Syscall)	tls (Cert/Other)	3xx	eof	ctx	???
se17-wireguard	100	100.0	286	724643	83146 (72400 954 9792)	48497 (14711 33786)	92230 (81881 10349)	2081	5815	43325	263
se17-wireguard	500	500.3	288	723176	84208 (72453 1367 10388)	48685 (15239 33446)	91664 (81341 10323)	2073	5513	44416	265
se17-wireguard	1000	1001.0	289	723834	83156 (72427 962 9767)	49559 (16347 33212)	91847 (81572 10275)	1852	5638	43856	258
se17-wireguard	1200	1201.5	289	724093	83078 (72450 905 9723)	48780 (15597 33183)	91868 (81656 10212)	1823	5708	44389	261
se17-wireguard	1200	1201.5	289	723788	83081 (72397 950 9734)	49070 (15848 33222)	91745 (81595 10150)	1790	5670	44589	267
se17-wireguard	1201	1202.5	288	723642	83063 (72413 909 9741)	48923 (15769 33154)	92120 (81575 10545)	1823	5322	44839	268
se17-wireguard	1202	1202.1	290	723846	83055 (72452 912 9691)	48999 (15916 33083)	91860 (81519 10341)	1813	5497	44669	261
se17-wireguard	1203	1204.5	289	723772	83051 (72479 882 9690)	48926 (15775 33151)	91945 (81630 10315)	1825	5502	44716	263
se17-wireguard	1204	1205.5	290	723816	83109 (72462 902 9745)	49256 (16161 33095)	92015 (81551 10464)	1762	5364	44420	258
se17-wireguard	1400	1402.1	288	721902	83808 (72426 1341 10041)	51820 (18732 33088)	91409 (81308 10101)	1727	5725	43345	264
se17-wireguard	1446	1448.2	290	720637	83037 (72463 924 9650)	49421 (16422 32999)	91416 (81132 10284)	1801	5517	47903	268
se17-wireguard	1447	1449.2	286	720927	83038 (72480 930 9628)	49361 (16463 32898)	91630 (81243 10387)	1807	5399	47580	258
se17-wireguard	1448	1450.2	288	720841	83016 (72492 933 9591)	49251 (16209 33042)	91636 (81236 10400)	1803	5410	47783	260
se17-wireguard	1449	1449.4	288	720456	83065 (72459 922 9684)	49513 (16554 32959)	91479 (81171 10308)	1786	5459	47981	261
se17-wireguard	1450	1450.3	288	720684	83036 (72476 915 9645)	49348 (16266 33082)	91608 (81238 10370)	1734	5404	47932	254
se17-wireguard	1450	1450.0	287	719193	83193 (72428 1319 9446)	53567 (20562 33005)	91390 (81135 10255)	1956	5775	44641	285

From the looks of it, the number of successful connections decrease somewhat as we are approaching the 1450/s limit. Comparing the most successful and least successful runs with regards to the number of connects we get per location:

Melbourne: 3705
New York: 3167
Stockholm: 5450

These differences are mostly due to more TCP timeouts and context deadlines.

What does this mean

Running from three different locations at limit 1450/s finds the same number of Onion-Location setups as all 28 measurements combined. That's what we wanted.

Connect errors (mainly TCP timeouts and context deadline errors) increase slightly as we use the higher limits. This is not what we wanted. However, the increase in connect errors per 1M sites is only 0.3-0.5%. These errors are transient, and should mostly be accounted for by having 3x tries per domain.

(Each scan is running with a shuffled list, similar to our full measurement.)

Conclusion: scanning from three different locations at limit 1450/s strikes a good balance between found Onion-Locations, errors, and timeliness of results.

2

The ct-sans dataset that we will onion-grab in the full measurement was collected and assembled at 2023-04-03. It contains 0.91B unique SANs.

To avoid biases like encountering the same errors at all VMs due to the order in which the sites were visited, the dataset is shuffled separately before use.

We did all shuffling on VM-1 because it has the most disk available.

Prepare shuffled dataset for VM-1:

$ seed="2023-04-03-vm-1"
$ time shuf\
      --random-source <(openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt </dev/zero 2>/dev/null)\
  -o vm-1.lst 2023-04-03-ct-sans/sans.lst

real    13m40.637s
user    10m30.368s
sys     2m28.062s
$ time sha256sum vm-1.lst
4bf4b2701e01dd7886757987a2a9f2750aff677c2bd9f3e28d6ca8a1b7c25a3b  vm-1.lst

real    2m51.630s
user    2m33.246s
sys     0m11.460s

Prepare shuffled dataset for VM-2:

$ seed="2023-04-03-vm-2"
$ time shuf\
      --random-source <(openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt </dev/zero 2>/dev/null)\
  -o vm-2.lst 2023-04-03-ct-sans/sans.lst

real    14m35.500s
user    11m31.577s
sys     2m31.447s
$ time sha256sum vm-2.lst
46f6c4af1e215f2d2cfb3ee302c8e3d02f43d4d918eb42f300a818e68f73f7ff  vm-2.lst

real    3m7.084s
user    2m36.416s
sys     0m19.012s

Prepare shuffled dataset for VM-3:

$ seed="2023-04-03-vm-3"
$ time shuf\
      --random-source <(openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt </dev/zero 2>/dev/null)\
  -o vm-3.lst 2023-04-03-ct-sans/sans.lst

real    14m37.878s
user    11m37.963s
sys     2m20.373s
$ time sha256sum vm-3.lst
c2df53320c1e7ab21355c9ebc1e53b1a8f564c9e7a2bd3e24f2cc8fca8b9eaf6  vm-3.lst

real    3m6.324s
user    2m36.804s
sys     0m17.056s

Double-check that we have the expected number of lines:

time wc -l vm-?.lst 2023-04-03-ct-sans/sans.lst
   907332515 vm-1.lst
   907332515 vm-2.lst
   907332515 vm-3.lst
   907332515 2023-04-03-ct-sans/sans.lst
  3629330060 total

real    7m54.915s
user    0m59.213s
sys     1m25.353s

Note: shuf is memory-hungry and needs ~2x the size of the input file. So, anything less than ~60GiB memory will be insufficient for a 25GiB dataset.

3

$ ulimit -Sn 100000
$ ulimit -a >ulimit.txt
# sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# sysctl -a >sysctl.txt
$ go install git.cs.kau.se/rasmoste/onion-grab@v0.0.2
$ git log | head -n1
commit abce43c4ad9000e0c5c83d31c2185986ab8a54c9
$ cd scripts
$ sha256sum vm-1.lst
4bf4b2701e01dd7886757987a2a9f2750aff677c2bd9f3e28d6ca8a1b7c25a3b  vm-1.lst
$ git diff
diff --git a/scripts/measure.sh b/scripts/measure.sh
index a520c6d..269b5ad 100755
--- a/scripts/measure.sh
+++ b/scripts/measure.sh
@@ -8,11 +8,11 @@
 # lists 8.8.8.8 and 8.8.4.4, output of uname -a and sysctl -a is ..., etc.
 #

-relay_country=se
-relay_city=sto
+relay_country=au
+relay_city=mel
 limit=1450
 num_workers=10000
-input_file=example.lst
+input_file=vm-1.lst
 timeout_s=30
 response_max_mib=64
 metrics_interval=1h

So, we selected Melbourne relays.

$ ./measure.sh 2>measure.stderr

4

$ ulimit -Sn 100000
$ ulimit -a >ulimit.txt
# sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# sysctl -a >sysctl.txt
$ go install git.cs.kau.se/rasmoste/onion-grab@v0.0.2
$ git log | head -n1
commit abce43c4ad9000e0c5c83d31c2185986ab8a54c9
$ cd scripts
$ sha256sum vm-2.lst
46f6c4af1e215f2d2cfb3ee302c8e3d02f43d4d918eb42f300a818e68f73f7ff  vm-2.lst
$ git diff
diff --git a/scripts/measure.sh b/scripts/measure.sh
index a520c6d..31b2f9e 100755
--- a/scripts/measure.sh
+++ b/scripts/measure.sh
@@ -8,11 +8,11 @@
 # lists 8.8.8.8 and 8.8.4.4, output of uname -a and sysctl -a is ..., etc.
 #

-relay_country=se
-relay_city=sto
+relay_country=us
+relay_city=nyc
 limit=1450
 num_workers=10000
-input_file=example.lst
+input_file=vm-2.lst
 timeout_s=30
 response_max_mib=64
 metrics_interval=1h

So, we selected New York relays.

$ ./measure.sh 2>measure.stderr

5

$ ulimit -Sn 100000
$ ulimit -a >ulimit.txt
# sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# sysctl -a >sysctl.txt
$ go install git.cs.kau.se/rasmoste/onion-grab@v0.0.2
$ git log | head -n1
commit abce43c4ad9000e0c5c83d31c2185986ab8a54c9
$ cd scripts
$ sha256sum vm-3.lst
c2df53320c1e7ab21355c9ebc1e53b1a8f564c9e7a2bd3e24f2cc8fca8b9eaf6  vm-3.lst
$ git diff
diff --git a/scripts/measure.sh b/scripts/measure.sh
index a520c6d..4cc0913 100755
--- a/scripts/measure.sh
+++ b/scripts/measure.sh
@@ -12,7 +12,7 @@ relay_country=se
 relay_city=sto
 limit=1450
 num_workers=10000
-input_file=example.lst
+input_file=vm-3.lst
 timeout_s=30
 response_max_mib=64
 metrics_interval=1h

So, we selected Stockholm relays (default).

$ ./measure.sh 2>measure.stderr

6

Notice that Stockholm relays are "slow". Bandwidth appear to have dropped to 1/10 of the initial part of the measurement. Unclear if there are more errors yet or not, and if this will sort itself out. Adding a cron job that prints hourly bandwidth stats every day at 23:59 to store more fine-grained data:

$ mkdir /home/rasmoste/vnstat
$ crontab -e

And add at the end of the file:

59 23 * * * vnstat -h >"/home/rasmoste/vnstat/$(date)"

(Added this on all three VMs.)

7

(In VM-3)

Bandwidth stats:

$ cat Tue\ Apr\ \ 4\ 11\:59\:01\ PM\ UTC\ 2023

 ens160  /  hourly

         hour        rx      |     tx      |    total    |   avg. rate
     ------------------------+-------------+-------------+---------------
     2023-04-04
         00:00     82.61 GiB |   12.78 GiB |   95.39 GiB |  227.61 Mbit/s
         01:00     80.93 GiB |   12.70 GiB |   93.63 GiB |  223.41 Mbit/s
         02:00     80.90 GiB |   12.68 GiB |   93.58 GiB |  223.30 Mbit/s
         03:00     81.13 GiB |   12.63 GiB |   93.77 GiB |  223.74 Mbit/s
         04:00     88.59 GiB |   12.97 GiB |  101.57 GiB |  242.35 Mbit/s
         05:00     85.10 GiB |   12.93 GiB |   98.04 GiB |  233.92 Mbit/s
         06:00     82.97 GiB |   12.84 GiB |   95.81 GiB |  228.61 Mbit/s
         07:00     79.05 GiB |   12.62 GiB |   91.67 GiB |  218.72 Mbit/s
         08:00     87.83 GiB |   12.81 GiB |  100.64 GiB |  240.13 Mbit/s
         09:00     81.22 GiB |   12.62 GiB |   93.84 GiB |  223.91 Mbit/s
         10:00     79.26 GiB |   12.57 GiB |   91.83 GiB |  219.12 Mbit/s
         11:00     81.70 GiB |   12.67 GiB |   94.37 GiB |  225.17 Mbit/s
         12:00     97.83 GiB |   13.21 GiB |  111.04 GiB |  264.94 Mbit/s
         13:00     82.47 GiB |   12.59 GiB |   95.06 GiB |  226.83 Mbit/s
         14:00     78.42 GiB |   11.46 GiB |   89.88 GiB |  214.45 Mbit/s
         15:00     27.42 GiB |    5.95 GiB |   33.37 GiB |   79.62 Mbit/s
         16:00     23.30 GiB |    5.37 GiB |   28.67 GiB |   68.42 Mbit/s
         17:00     28.12 GiB |    6.03 GiB |   34.15 GiB |   81.48 Mbit/s
         18:00     48.01 GiB |    8.76 GiB |   56.77 GiB |  135.46 Mbit/s
         19:00     40.23 GiB |    7.73 GiB |   47.97 GiB |  114.46 Mbit/s
         20:00     55.55 GiB |    9.63 GiB |   65.18 GiB |  155.52 Mbit/s
         21:00     35.10 GiB |    7.06 GiB |   42.16 GiB |  100.60 Mbit/s
         22:00     20.94 GiB |    5.00 GiB |   25.94 GiB |   61.91 Mbit/s
         23:00     21.19 GiB |    4.95 GiB |   26.14 GiB |   68.03 Mbit/s
     ------------------------+-------------+-------------+---------------

We were hoping that this was a transient error, but all relays in Stockholm appear to underperform. The rate has dropped as a result, and the number of successes as well. See separate data and log files in our dataset (se-sto/).

It will be faster, and give more accurate results, to start from a new location.

Kill: pidof onion-grab, kill <PID>.

Move measure.stderr to the data dir to not overwrite it when we restart.

8

(In VM-3.)

We experienced the same "slowness" with both Gothenburg and Malmo relays. When moving our measurement to Frankfurt, good bandwidth is observed again.

diff --git a/scripts/measure.sh b/scripts/measure.sh
index a520c6d..d46f9c1 100755
--- a/scripts/measure.sh
+++ b/scripts/measure.sh
@@ -8,11 +8,11 @@
 # lists 8.8.8.8 and 8.8.4.4, output of uname -a and sysctl -a is ..., etc.
 #

-relay_country=se
-relay_city=sto
+relay_country=de
+relay_city=fra
 limit=1450
 num_workers=10000
-input_file=example.lst
+input_file=vm-3.lst
 timeout_s=30
 response_max_mib=64
 metrics_interval=1h

So, we selected Frankfurt relays.

Without any other restarts in the same tmux pane as before:

$ ./measure.sh 2>measure.stderr

9

The summary prints (which means that the Go receiver routine waited for an answer for at least one timeout and shutdown) are shown in onion-grab's stderr output, however onion-grab hangs after that so the measure.sh script doesn't exit.

VM-1 (au mel) processed up until: 907330676
VM-2 (us nyc) processed up until: 907330662

To be compared with the number of entries in the ct-sans dataset: 907332515.

$ python3 -c "print(f'{907332515 - 907330676}')"
1839
$ python3 -c "print(f'{907332515 - 907330662}')"
1853

So, it appears that we have ~1800 workers that were unable to provide their final answers (most likely timeouts) before the receiver routine shutdown. This explains why onion-grab hangs, i.e., there are still workers that are waiting to send their answers to the receiver who is not reading answers anymore.

In addition to the outstanding answers most likely being timeouts, it is not the same ~1800 answers on all machines since the dataset was shuffled for all VMs.

Action: ctrl+C the measurement script that is waiting for onion-grab to complete, we already have the onion-grab output that we want stored to disk.

10

Latest onion-grab stderr print was 2023/04/11 20:25:50, then died due to too little memory. Latest progress print was:

2023/04/11 20:02:33 INFO: metrics@receiver:

  Processed: 819368251

So, we can safely continue without missing any sites with Onion-Location configured by starting a new measurement from line ~819368251.

$ python3 -c "print(f'{907332515 - 819368251}')"
87964264
$ tail -n87964264 vm-3.lst > vm-3-remaining.lst
$ wc -l vm-3-remaining.lst
87964264 vm-3-remaining.lst

11

Restart onion-grab from VM-3 with the final domain names to visit.

$ git diff
diff --git a/scripts/measure.sh b/scripts/measure.sh
index a520c6d..6d77c66 100755
--- a/scripts/measure.sh
+++ b/scripts/measure.sh
@@ -8,11 +8,11 @@
 # lists 8.8.8.8 and 8.8.4.4, output of uname -a and sysctl -a is ..., etc.
 #

-relay_country=se
-relay_city=sto
+relay_country=de
+relay_city=fra
 limit=1450
 num_workers=10000
-input_file=example.lst
+input_file=vm-3-remaining.lst
 timeout_s=30
 response_max_mib=64
 metrics_interval=1h
$ ./measure.sh 2>measure-remaining.stderr

(onion-grab results are written to a separate directory that is timestamped, so there is no risk that the above command will overwrite any collected data.)

12

Latest onion-grab stderr print was 2023/04/11 23:43:19, then died due to too little memory. Latest progress print was:

2023/04/11 23:36:31 INFO: metrics@receiver:

  Processed: 5217381

So, we can safely continue without missing any sites with Onion-Location configured by starting a new measurement from line ~5217381.

$ python3 -c "print(f'{87964264 - 5217381}')"
82746883
$ tail -n82746883 vm-3-remaining.lst > vm-3-remaining-2.lst
$ wc -l vm-3-remaining-2.lst
82746883 vm-3-remaining-2.lst

13

Restart onion-grab from VM-3 with the final domain names to visit. However reducing the number of workers to see if that can keep us from blowing up. If this doesn't work we will have to bump the amount of memory in our VM.

(The large amount of workers is anyway not necessary with low latency.)

$ git diff
diff --git a/scripts/measure.sh b/scripts/measure.sh
index a520c6d..3b2e54b 100755
--- a/scripts/measure.sh
+++ b/scripts/measure.sh
@@ -8,11 +8,11 @@
 # lists 8.8.8.8 and 8.8.4.4, output of uname -a and sysctl -a is ..., etc.
 #

-relay_country=se
-relay_city=sto
+relay_country=de
+relay_city=fra
 limit=1450
-num_workers=10000
-input_file=example.lst
+num_workers=4000
+input_file=vm-3-remaining-2.lst
 timeout_s=30
 response_max_mib=64
 metrics_interval=1h
$ ./measure.sh 2>measure-remaining-2.stderr

14

Renaming and moving output in VM-1:

$ mv data/20230403-223517 au-mel
$ rmdir data 
$ mv au-mel/au-mel-l1450.stderr au-mel/onion-grab.stderr
$ mv au-mel/au-mel-l1450.stdout au-mel/onion-grab.stdout
$ mv sysctl.txt au-mel/
$ mv ulimit.txt au-mel/
$ mv measure.stderr au-mel/
$ ls -l au-mel/
total 6992
-rw-rw-r-- 1 rasmoste rasmoste     800 Apr  3 22:36 measure.stderr
-rw-rw-r-- 1 rasmoste rasmoste 3749490 Apr 11 08:21 onion-grab.stderr
-rw-rw-r-- 1 rasmoste rasmoste 3346026 Apr 11 04:29 onion-grab.stdout
-rw-rw-r-- 1 rasmoste rasmoste   42500 Apr  3 22:11 sysctl.txt
-rw-rw-r-- 1 rasmoste rasmoste     823 Apr  3 22:11 ulimit.txt
$ mv au-mel ~/exp/onion-grab/data/2023-04-03-ct-sans/

15

Renaming and moving output in VM-2:

$ mv data/20230403-223519 us-nyc
$ rmdir data
$ mv us-nyc/us-nyc-l1450.stdout us-nyc/onion-grab.stdout
$ mv us-nyc/us-nyc-l1450.stderr us-nyc/onion-grab.stderr
$ mv sysctl.txt us-nyc/
$ mv ulimit.txt us-nyc/
$ mv measure.stderr us-nyc/
$ ls -l us-nyc
total 6784
-rw-rw-r-- 1 rasmoste rasmoste     800 Apr  3 22:35 measure.stderr
-rw-rw-r-- 1 rasmoste rasmoste 3553624 Apr 11 08:21 onion-grab.stderr
-rw-rw-r-- 1 rasmoste rasmoste 3326545 Apr 11 04:25 onion-grab.stdout
-rw-rw-r-- 1 rasmoste rasmoste   42531 Apr  3 22:12 sysctl.txt
-rw-rw-r-- 1 rasmoste rasmoste     823 Apr  3 22:11 ulimit.txt

Zip and checksum before moving to VM-1:

$ zip -r us-nyc.zip us-nyc/
$ sha256sum us-nyc.zip
8759b8e7192390cc8f125a795c55b55ad9ecadb27344ce88004998ca89b7c4be  us-nyc.zip

Transfer to VM-1, check that checksum is OK then unzip.

16

Renaming an moving in VM-3:

$ mv data/20230403-223521 se-sto
$ mv se-sto/se-sto-l1450.stderr se-sto/onion-grab.stderr
$ mv se-sto/se-sto-l1450.stdout se-sto/onion-grab.stdout
$ cp ulimit.txt se-sto/
$ cp sysctl.txt se-sto/
$ mkdir se-sto/bw
$ cp ~/vnstat/"Tue Apr  4 11:59:01 PM UTC 2023" se-sto/bw
$ cp ~/vnstat/"Wed Apr  5 11:59:01 PM UTC 2023" se-sto/bw
$ ls -l se-sto
total 912
drwxrwxr-x 2 rasmoste rasmoste   4096 Apr 12 16:55 bw
-rw-rw-r-- 1 rasmoste rasmoste    801 Apr  3 22:35 measure.stderr
-rw-rw-r-- 1 rasmoste rasmoste 441711 Apr  5 06:36 onion-grab.stderr
-rw-rw-r-- 1 rasmoste rasmoste 424925 Apr  5 06:27 onion-grab.stdout
-rw-rw-r-- 1 rasmoste rasmoste  42529 Apr 12 16:54 sysctl.txt
-rw-rw-r-- 1 rasmoste rasmoste    823 Apr 12 16:54 ulimit.txt
$ zip -r se-sto.zip se-sto/
$ sha256sum se-sto.zip
6fcd5640b1022828d19f3585b2a9c9488ce5c681a81a61c22b1bd4cbbe326b49  se-sto.zip

Move to VM-1, check checksum and unzip.

17

VM-1:

$ mv ~/vnstat au-mel/bw

Then stop the cronjob that creates bw output (crontab -e).

VM-2:

$ mv ~/vnstat bw
$ zip -r bw.zip bw/
$ sha256sum bw.zip
c4753326fcdb4dd136af81c1359cfe37fe6756726c497f39d3c33f799fc975f3  bw.zip

Transfer to VM-1, check checksum, unzip and put in us-nyc directory. Then stop the cronjob that creates bw output in VM-2 as well.

18

onion-grab hangs on shutdown similar to VM-1 and VM-2 [9]. The final summary print shows processed until 82746708, which should be compared to the size of 82746883 (vm-3-remaining-2.lst). I.e., 175 missing workers/answers.

Same action as in [9], ctrl+C measurement script.

19

Renaming and moving in VM-3, first run:

$ mv data/20230405-070154 de-fra
$ mv de-fra/de-fra-l1450.stderr de-fra/onion-grab.stderr
$ mv de-fra/de-fra-l1450.stdout de-fra/onion-grab.stdout
$ mv measure.stderr de-fra/measure.stderr
$ mv ulimit.txt de-fra/
$ mv sysctl.txt de-fra/

Second run:

$ mv data/20230411-223623/de-fra-l1450.stderr de-fra/onion-grab-2.stderr
$ mv data/20230411-223623/de-fra-l1450.stdout de-fra/onion-grab-2.stdout
$ rmdir data/20230411-223623
$ mv measure-remaining.stderr de-fra/measure-2.stderr

Third run:

$ mv data/20230412-084228/de-fra-l1450.stderr de-fra/onion-grab-3.stderr
$ mv data/20230412-084228/de-fra-l1450.stdout de-fra/onion-grab-3.stdout
$ rmdir data/20230412-084228
$ mv measure-remaining-2.stderr de-fra/measure-3.stderr

Grab bandwidths, exclude output from 4th since this measurement started 5th:

$ rm ~/vnstat/"Tue Apr  4 11:59:01 PM UTC 2023"
$ vnstat -h >"/home/rasmoste/vnstat/$(date)"
$ mv ~/vnstat de-fra/bw

Overview:

$ ls -l de-fra
total 6768
drwxrwxr-x 2 rasmoste rasmoste    4096 Apr 13 05:39 bw
-rw-rw-r-- 1 rasmoste rasmoste    1019 Apr 11 23:43 measure-2.stderr
-rw-rw-r-- 1 rasmoste rasmoste     810 Apr 12 08:42 measure-3.stderr
-rw-rw-r-- 1 rasmoste rasmoste    1009 Apr 11 20:25 measure.stderr
-rw-rw-r-- 1 rasmoste rasmoste   24004 Apr 11 23:43 onion-grab-2.stderr
-rw-rw-r-- 1 rasmoste rasmoste   23002 Apr 11 23:42 onion-grab-2.stdout
-rw-rw-r-- 1 rasmoste rasmoste  318627 Apr 13 05:38 onion-grab-3.stderr
-rw-rw-r-- 1 rasmoste rasmoste  312774 Apr 13 00:34 onion-grab-3.stdout
-rw-rw-r-- 1 rasmoste rasmoste 3117995 Apr 11 20:25 onion-grab.stderr
-rw-rw-r-- 1 rasmoste rasmoste 3034130 Apr 11 20:25 onion-grab.stdout
-rw-rw-r-- 1 rasmoste rasmoste   42529 Apr  3 22:12 sysctl.txt
-rw-rw-r-- 1 rasmoste rasmoste     823 Apr  3 22:11 ulimit.txt

Then stop the cronjob that creates bw outputs (crontab -e).

Zip, checksum, and transfer to VM-1:

$ zip -r de-fra.zip de-fra/
$ sha256sum de-fra.zip
2ea1f053decea3915b29bc60c2f954da55ea48f6d8ab9f47112caddf3a2e2f7f  de-fra.zip

20

Pier found that onion-grab follows redirects without correctly attributing the Onion-Location configuration to the destination it was redirected to, see:

https://gitlab.torproject.org/tpo/onion-services/onion-grab/-/issues/1

This explained an anomaly where it looked like a lot of sites were, e.g., configuring Twitter/X's Onion-Location when in fact they were redirecting.

Use scripts/digest2.py to get a cleaner picture of the distribution of sites that use HTTP and HTML for configuring Onion-Location. Since this was found before the camera-ready deadline, we were able to update §4.1.2 accordingly.

To avoid this bug in the future, onion-grab was patched on 2025-05-11 to (still) follow redirects but then associate Onion-Location with the final destination.