\section{Sources of Website Oracles} \label{cat:sec:sources}
There are a wide range of potential sources of WOs. Table~\ref{cat:table:sources}
summarizes a selection of sources that are more thoroughly detailed in
Appendix~\ref{cat:app:sources}. The table shows the availability of the source,
i.e., if the attacker needs to query the source in near real-time as a website
visit occurs or if it can be accessed retroactively, e.g., through a legal
request. We also estimate qualitatively the false positive rate of the source,
its coverage of websites it can monitor (or fraction of Tor network traffic,
depending on source), as well as the estimated effort to access the source.
Finally, the table gives an example of an actor with access to the source. 

Next we focus on a number of sources of WOs that we find particularly relevant:
several due to DNS in Section~\ref{cat:sec:sources:dns}, the DHT of Tor onion
directory services in Section~\ref{cat:sec:sources:dht}, and real-time bidding
platforms in Section~\ref{cat:sec:sources:rtb}. 

\begin{sidewaystable}
	\caption{Comparison of a number of WO sources based on their \emph{estimated} time of availability (when attacker likely has to collect data, i.e., retroactively or real-time), False Positive Rate (FPR), coverage of website/network visits, and primary entities with access.}
	\centering
	\label{cat:table:sources}
	\begin{tabular}{l c c c c r}
	Source &  Availability & FPR & Coverage & Effort & Access \\ \midrule
	Dragnet surveillance programmes  & retroactive & negl. & high & high & intelligence agencies \\
	Content Delivery Networks & retroactive & negl. & high & high & operators\\
	Real-time bidding & real-time (retroactive) & negl. & high & modest & customers (operator)\\
	Webserver access logs & retroactive & negl. & high & medium & operators\\
	Middleboxes & retroactive~\cite{bluecoat} & negl. & medium & medium & operators \\
	OCSP & retroactive & low & high & medium & few CAs, plaintext\\
	8.8.8.8 operator & retroactive & low~\cite{DBLP:conf/ndss/GreschbachPRWF17} & 16.8\% of visits & high & Google, plaintext \\
	1.1.1.1 operator & retroactive & low~\cite{DBLP:conf/ndss/GreschbachPRWF17} & 7.4\% of visits & high & Cloudflare, plaintext \\
	Exit relays & real-time & negl. & low & low & operators \\
	Exit relays DNS cache & real-time & medium & high & medium & anyone\\
	Query DNS resolvers & real-time & high & low & low & anyone \\
	Onion v2 (v3) & real-time & negl. & high (low) & low (high) & anyone \\
	% &  &  &  &  & \\
	\end{tabular}
\end{sidewaystable}

\subsection{DNS}
\label{cat:sec:sources:dns}
Before a website visit the corresponding domain name must be resolved to an IP
address. For a user that uses Tor browser, the exit relay of the
current circuit resolves the domain name. If the DNS record of the domain
name is already cached in the DNS cache of the exit relay, then the exit relay
uses that record. Otherwise the domain name is resolved and subsequently cached
using whichever DNS resolution mechanism that the exit relay has configured.
Based on this process we present three sources of WOs that work for unpopular
websites.

\subsubsection{Shared Pending DNS Resolutions}
If an exit relay is asked to resolve a domain name that is uncached it will
create a list of pending connections waiting for the domain resolution to
finish. If another connection asks that the same domain name be resolved, it is
added to the list of pending connections.  When a result is available all
pending connections are informed.  This is the basis of a WO: if a request to
resolve a domain name returns a record \emph{more quickly than previously
measured by the attacker for uncached entries}, the entry was either pending
resolution at the time of the request or already cached. Notably this works
regardless of if exit relays have DNS caches or not. However, the timing
constraints of shared pending connections are significant and thus a practical
hurdle to overcome.

\subsubsection{Tor's DNS Cache at Exit Relays}
If an unpopular website is visited by a user, the resolved domain name will
likely be cached by a \emph{single} relay.  We
performed 411 \texttt{exitmap}~\cite{DBLP:conf/pet/WinterKMHSLW14}
measurements between April 1--10 (2019), collecting on average 3544
(un)cached data points for each exit using a domain under our control
that is not used by anyone else.

Given a labelled data set of (un)cached times for each exit relay, we can
construct distinct \emph{per-relay} classifiers that predict whether a measured
time corresponds to an (un)cached domain name. While there are many different
approaches that could be used to build such a classifier, we decided to use a
simple heuristic that should result in little or no false positives: output
`cached' iff no uncached query has \emph{ever} been this fast before.
Figure~\ref{cat:fig:dns:classifier-idea} shows the idea of this classifier in
greater detail, namely create a \emph{threshold} that is the minimum of the
largest cached time and the smallest uncached time and then say cached iff the
measured time is smaller than the threshold. Regardless of how well this
heuristic performs (see below), it should be possible to construct other
classifiers that exploit the trend of smaller resolve times and less standard
deviation for cached queries (Figure~\ref{cat:fig:dns:dist}). For example, 69.1\% of
all exit relays take at least 50~ms more time to resolve an uncached domain on
average.

\begin{figure}[!t]
	\centering
	\includegraphics[width=.7\columnwidth]{src/cat/img/dns__classifier-idea}
	\caption{The two cases when deciding on a classifier's threshold.}
	\label{cat:fig:dns:classifier-idea}
\end{figure}

\begin{figure}[!t]
	\centering
	\includegraphics[width=.7\columnwidth]{src/cat/img/dns__timing-dist}
	\caption{%
		The difference between (un)cached standard deviation and mean times
		without any absolute values, i.e., a negative value implies
		that the uncached time is smaller than the cached time.
	}
	\label{cat:fig:dns:dist}
\end{figure}


To estimate an \emph{upper bound} on how effective the composite classifier of
all per-relay classifiers could be \emph{without any false positives} using
our heuristic, we applied ten-fold cross-validation to simply exclude every
exit relay that had false positives during any fold and then weighted the
observed bandwidth for the remaining classifiers by the individual true positive
rates.  This gives us an
estimate of how much bandwidth we could predict true positives for without
having any false positives. By comparing it to the total exit bandwidth of the
Tor network, we obtain an estimated upper bound true positive rate for the
composite classifier of $17.3$\%.

When an attacker measures if a domain is cached or not the domain will, after
the measurement, be cached for up to an hour (current highest caching duration
in Tor, independent of TTL) at every exit. However, if a an attacker can cause
an exit to run low on memory, the entire DNS cache will be removed (instead of
only parts of it) due to a bug in the out-of-memory manager of Tor. We have
reported this to the Tor
Project~\cite{bug-report}.
We further discuss in Section~\ref{cat:sec:disc} how frequently an attacker on
average can be expected to query a WO.

\subsubsection{Caching at Recursive DNS Resolvers}
For a website that is unpopular enough, there is a high chance that nobody on
the web visited the website within a given timeframe.  This is the basis of
our next idea for a WO which is \emph{not mutually exclusive} to the Tor
network: wait a couple of seconds after observing a connection, then probe all
recursive DNS resolvers of Tor exits that can be accessed to determine whether
any monitored website was cached approximately at the time of observing the
connection by inspecting TTLs. %the TTL of returned DNS records.

In 2016 Greschbach~\emph{et~al.}~\cite{DBLP:conf/ndss/GreschbachPRWF17} showed
that remote DNS resolvers like Google's \texttt{8.8.8.8} receive a large
portion of all DNS traffic that exit relays generate. To better understand how
the DNS resolver landscape looks today, we repeated their RIPE Atlas experiment
setup for 35 hours in February 17--18 (2019), measuring every 30 minutes. Our
results show that Google (16.8\%) and Cloudflare (7.4\%) are both popular. Many
exits use a same-AS resolver which is presumably the ISP (42.3\%), while other
exits resolve themselves (15.2\%) or use a remote DNS resolver that we did not
identify (18.2\%). Further, we note that there are at least one RIPE Atlas
network measurement probe in the same AS as 53.3\% of all exits, providing
access to many of the same DNS resolvers as used by exits from a similar network
vantage point. 

Instead of using RIPE Atlas nodes we opted for a different approach which is
\emph{strictly worse}: query Google's and Cloudflare's DNS resolvers from VMs in
16 Amazon EC2 regions.  With a simple experiment of first visiting a unique
domain (once again under our control and only used by us) using \texttt{torify
curl} and then querying the DNS resolvers from each Amazon VM to observe TTLs,
we got true positive rates of 2.9\% and 0.9\% for Google and Cloudflare with
1000 repetitions. While this may seem low, the cost for an attacker is at the
time of writing about 2 USD per day using on-demand pricing. Using an identical
setup we were also able to find a subset of monitored websites that yield
alarmingly high true positive rates: 61.4\% (Google) and 8.0\% (Cloudflare).
Presumably this was due to the cached entries being shared over a wider
geographical area for some reason (however, not globally). Regardless, coupled
with the fact that anyone can \emph{globally purge} the DNS caches of
Google~\cite{google-purge} and Cloudflare~\cite{cf-purge} for arbitrary domain
names, this is a noteworthy WO source.

\subsection{Onion Service Directories in Tor}
\label{cat:sec:sources:dht}
To access an onion service a user first obtains the service's \emph{descriptor}
from a Distributed Hash Table (DHT) maintained by \emph{onion service
directories}. From the descriptor the user learns of \emph{introduction points}
selected by the host of the onion service in the Tor network that are used to
establish a connection to the onion service in a couple of more
steps~\cite{onionv2,onionv3} that are irrelevant here. Observing a request for
the descriptor of a monitored onion service is a source for a WO. To observe
visits for a target (known) onion service in the DHT, a relay first has to be
selected as one out of six or eight (depending on version) relays to host the
descriptor in the DHT, and then the victim has to select that relay to retrieve
the descriptor. For v2 of onion services, the location in the DHT is
deterministic~\cite{onionv2} and an attacker can position its relays in such a
way to always be selected for hosting target descriptors. Version 3 of onion
services addresses this issue by randomising the process every 24
hours~\cite{onionv3}, forcing an attacker to host a significant number of relays
to get a WO for onion services with high coverage. At the time of writing, there
are about 3,500 relays operating as onion service directories.

\subsection{Real-Time Bidding}
\label{cat:sec:sources:rtb}
Real-Time Bidding (RTB) is an approach towards online advertisement that allows
a publisher to auction ad space to advertisers on a \emph{per-visit} basis in
real time~\cite{rtb}. Google's Display Network includes more than two million
websites that reach 90\% of all Internet users~\cite{google-dn}, and an
advertiser that uses RTB must respond to submitted bid
requests containing information such as the three first network bytes of an IPv4
address, the second-level domain name of the visited website, and the user agent
string within $\approx$100~ms~\cite{google-bid}. While the exact information
available to the bidder depends on the ad platform and the publisher's
advertisement settings, anonymous modes provide less
revenue~\cite{google-bid-anon}.
Combined with many flavours of pre-targeting such as IP and location
filtering~\cite{DBLP:conf/wpes/VinesRK17}, it is likely that the bidder knows
whether a user used Tor while accessing a monitored website. Vines et
al.~\cite{DBLP:conf/wpes/VinesRK17} further note that ``35\% of the DSPs also
allow arbitrary IP white-and blacklisting (Admedo, AdWords, Bing, BluAgile,
Criteo, Centro, Choozle, Go2Mobi, Simpli.fi)''. Finally, observe that an
attacker need not win a bid to use RTB as a WO.