

This document describes our ct-sans data collection, including information about the local system and a timeline leading up to assembling the 2023-04-03 dataset.


The initial download time for the current CT logs was 11 days (March 2023). The time to assemble the final dataset of 0.91B unique SANs (25.2GiB) was 6 hours.

The assembled data set is available here:

  • https://dart.cse.kau.se/ct-sans/2023-04-03-ct-sans.zip

Local system

We're running Ubuntu in a VM:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.2 LTS
Release:        22.04
Codename:       jammy

Our VM is configured with 62.9GiB RAM, one CPU core with 32 CPU threads, and a ~2TiB SSD:

$ grep MemTotal /proc/meminfo
32 
$ grep -c processor /proc/cpuinfo
$ grep 'cpu cores' /proc/cpuinfo | uniq
cpu cores       : 1
$ df -BG /home
Filesystem                        1G-blocks  Used Available Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv     2077G  220G     1772G  12% /

This VM shares a 1x10Gbps link with other network VMs that we have no control over. We installed vnstat to track our own bandwidth-usage over time:

# apt install vnstat
# systemctl enable vnstat.service
# systemctl start vnstat.service

We also installed Go version 1.20, see install instructions:

$ go version
go version go1.20.2 linux/amd64

The versions of git.cs.kau.se/rasmoste/ct-sans@VERSION are listed below.


date time (UTC) event notes
2023/03/18 20:05:30 snapshot and start collect running v0.0.1, see command notes [1]
2023/03/27 14:53:59 stop collect, bump version install v0.0.2, see migrate notes [2]
2023/03/27 15:03:12 start collect again mainly waiting for Argon2023 now [3]
2023/03/29 10:22:24 collect completed
2023/03/29 15:46:44 snapshot and collect again download backlog from last 10 days
2023/03/30 05:52:38 collect completed
2023/03/30 08:58:50 snapshot and collect again download backlog from last ~16 hours
2023/03/30 09:53:34 collect completed bandwidth usage statistics [4]
2023/03/30 10:05:40 start assemble still running v0.0.2 [5]
2023/03/30 16:06:39 assemble done 0.9B sans (25GiB, 7GiB zipped in 15m)
2023/04/02 23:31:37 snapshot and collect again download backlog, again
2023/04/03 03:54:18 collect completed
2023/04/03 08:52:28 snapshot and collect again final before assembling for real use
2023/04/03 09:22:22 collect completed
2023/04/03 09:30:00 start assemble [5]
2023/04/03 16:12:38 assemble done 0.91B SANs (25.2GiB) from 3.74B certs
2024/02/10 09:10:20 snapshot and start collect still running v0.0.2 [6]
2024/02/12 03:54:13 abort collection not needed for our paper contribs [7]



$ ct-sans snapshot >snapshot.stdout
$ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr


In addition to adding the assemble command, v0.0.2 stores notice.txt files in each log's directory automatically. This ensures that the output in stdout can be discarded as opposed to being stored and managed manually in the long run (e.g., grep for NOTICE prints when assembling data sets).

Commit ad9fb49670e28414637761bac4b8e8940e2d6770 includes a Go program that transforms an existing collect.stderr file to notice.txt files.

Steps to migrate:

  • [x] Stop (ctrl+c, wait)
  • [x] Move collect.{stdout,stderr} to data/notes/
  • [x] grep NOTICE data/notes/collect.stdout | wc -l gives 6919 lines
  • [x] run the program in the above commit with the appropriate directory and noticeFile paths. See output below.
  • [x] `wc -l $(find . -name notice.txt) -> total says 6919 lines
  • [x] go install git.cs.kau.se/rasmoste/ct-sans@latest, downloaded v0.0.2
  • [x] run the same collect command as in note (1); this will not overwrite the previous collect files because they have been moved to data/notes/. In the future we will not need to store any of this, but doing it now just in case something goes wrong.
  • [x] The only two logs that had entries left to download resumed

Output from migrate program and santity check:

$ go run .
2023/03/27 14:57:41 Google 'Argon2023' log: 608 notices
2023/03/27 14:57:41 Google 'Argon2024' log: 101 notices
2023/03/27 14:57:41 Google 'Xenon2023' log: 2119 notices
2023/03/27 14:57:41 Google 'Xenon2024' log: 170 notices
2023/03/27 14:57:41 Cloudflare 'Nimbus2023' Log: 2194 notices
2023/03/27 14:57:41 Cloudflare 'Nimbus2024' Log: 164 notices
2023/03/27 14:57:41 DigiCert Yeti2024 Log: 17 notices
2023/03/27 14:57:41 DigiCert Yeti2025 Log: no notices
2023/03/27 14:57:41 DigiCert Nessie2023 Log: 155 notices
2023/03/27 14:57:41 DigiCert Nessie2024 Log: 19 notices
2023/03/27 14:57:41 DigiCert Nessie2025 Log: no notices
2023/03/27 14:57:41 Sectigo 'Sabre' CT log: 1140 notices
2023/03/27 14:57:41 Let's Encrypt 'Oak2023' log: 156 notices
2023/03/27 14:57:41 Let's Encrypt 'Oak2024H1' log: 14 notices
2023/03/27 14:57:41 Let's Encrypt 'Oak2024H2' log: no notices
2023/03/27 14:57:41 Trust Asia Log2023: 62 notices
2023/03/27 14:57:41 Trust Asia Log2024-2: no notices
$ wc -l $(find . -name notice.txt)
101 ./data/logs/eecdd064d5db1acec55cb79db4cd13a23287467cbcecdec351485946711fb59b/notice.txt
 14 ./data/logs/3b5377753e2db9804e8b305b06fe403b67d84fc3f4c7bd000d2d726fe1fad417/notice.txt
155 ./data/logs/b3737707e18450f86386d605a9dc11094a792db1670c0b87dcf0030e7936a59a/notice.txt
 62 ./data/logs/e87ea7660bc26cf6002ef5725d3fe0e331b9393bb92fbf58eb3b9049daf5435a/notice.txt
164 ./data/logs/dab6bf6b3fb5b6229f9bc2bb5c6be87091716cbb51848534bda43d3048d7fbab/notice.txt
608 ./data/logs/e83ed0da3ef5063532e75728bc896bc903d3cbd1116beceb69e1777d6d06bd6e/notice.txt
156 ./data/logs/b73efb24df9c4dba75f239c5ba58f46c5dfc42cf7a9f35c49e1d098125edb499/notice.txt

1140 ./data/logs/5581d4c2169036014aea0b9b573c53f0c0e43878702508172fa3aa1d0713d30c/notice.txt
 19 ./data/logs/73d99e891b4c9678a0207d479de6b2c61cd0515e71192a8c6b80107ac17772b5/notice.txt
170 ./data/logs/76ff883f0ab6fb9551c261ccf587ba34b4a4cdbb29dc68420a9fe6674c5a3a74/notice.txt
2119 ./data/logs/adf7befa7cff10c88b9d3d9c1e3e186ab467295dcfb10c24ca858634ebdc828a/notice.txt
2194 ./data/logs/7a328c54d8b72db620ea38e0521ee98416703213854d3bd22bc13a57a352eb52/notice.txt
 17 ./data/logs/48b0e36bdaa647340fe56a02fa9d30eb1c5201cb56dd2c81d9bbbfab39d88473/notice.txt
6919 total


For some reason Nimbus2023 is stuck at


while trying to fetch until


These tree heads are not inconsistent, and a restart should resolve the problem.

(There is likely a corner-case somewhere that made the fetcher exit or halt. We should debug this further at some point; but have not happened more than once.)


Quick overview:

$ vnstat -d
 ens160  /  daily

          day        rx      |     tx      |    total    |   avg. rate
     2023-03-18     1.49 TiB |   17.07 GiB |    1.51 TiB |  153.44 Mbit/s
     2023-03-19     3.77 TiB |   41.21 GiB |    3.81 TiB |  387.83 Mbit/s
     2023-03-20     3.09 TiB |   36.67 GiB |    3.13 TiB |  318.26 Mbit/s
     2023-03-21     3.11 TiB |   32.24 GiB |    3.14 TiB |  319.61 Mbit/s
     2023-03-22     2.08 TiB |   25.98 GiB |    2.10 TiB |  213.89 Mbit/s
     2023-03-23     1.16 TiB |   15.59 GiB |    1.18 TiB |  119.97 Mbit/s
     2023-03-24     1.17 TiB |   15.44 GiB |    1.18 TiB |  120.44 Mbit/s
     2023-03-25     1.18 TiB |   15.72 GiB |    1.19 TiB |  121.55 Mbit/s
     2023-03-26   707.47 GiB |    9.64 GiB |  717.11 GiB |   71.30 Mbit/s
     2023-03-27   448.80 GiB |    6.43 GiB |  455.23 GiB |   45.26 Mbit/s
     2023-03-28   451.49 GiB |    6.49 GiB |  457.98 GiB |   45.53 Mbit/s
     2023-03-29     1.01 TiB |   12.73 GiB |    1.03 TiB |  104.45 Mbit/s
     2023-03-30   256.75 GiB |    3.40 GiB |  260.15 GiB |   59.59 Mbit/s
     estimated    591.55 GiB |    7.84 GiB |  599.39 GiB |


Use at most 58GiB RAM for sorting, 8 parallel sort workers. More than this does not improve performance according to the GNU sort manual. We're also setting the LC_ALL=C variable to ensure consistent sort order (see man).

$ export LC_ALL=C
$ ct-sans assemble -b 58 -p 8 >assemble.stdout

(We don't need to change the default directories, because the collected data is stored in ./data and /tmp is a fine place to put things on our system.)


There are 0.91B unique SANs in the 25.2GiB dataset (6.1GiB compressed):

$ du -shb data/archive/2023-04-03-ct-sans
27050799992     data/archive/2023-04-03-ct-sans
$ python3 -c "print(f'{27050799992 / 1024**3:.1f}GiB')"
$ du -shb data/archive/2023-04-03-ct-sans.zip
6526876407      data/archive/2023-04-03-ct-sans.zip
$ python3 -c "print(f'{6526876407 / 1024**3:.1f}GiB')"
$ wc -l data/archive/2023-04-03-ct-sans/sans.lst
907332515 data/archive/2023-04-03-ct-sans/sans.lst
$ python3 -c "print(f'{907332515 / 1000**3:.2f}B')"

These SANs were found in 3.74B certificates from 17 CT logs:

$ grep "In total," data/archive/2023-04-03-ct-sans/README.md
In total, 3743244652 certificates were downloaded from 17 CT logs;
$ python3 -c "print(f'{3743244652 / 1000**3:.2f}B')"


$ ct-sans snapshot >snapshot.stdout
$ ct-sans collect --workers 40 --batch-disk 131072 --batch-req 2048 --metrics 60m >collect.stdout 2>collect.stderr


We decided to abort another round of ct-sans (and following onion-grab) measurements because it is not strictly needed to achieve our goals. If we want to make more measurements for the sake of making the ct-sans data set available, we should automate it rather than doing it manually as in this timeline.