\section{Sources of Website Oracles} \label{cat:sec:sources} There are a wide range of potential sources of WOs. Table~\ref{cat:table:sources} summarizes a selection of sources that are more thoroughly detailed in Appendix~\ref{cat:app:sources}. The table shows the availability of the source, i.e., if the attacker needs to query the source in near real-time as a website visit occurs or if it can be accessed retroactively, e.g., through a legal request. We also estimate qualitatively the false positive rate of the source, its coverage of websites it can monitor (or fraction of Tor network traffic, depending on source), as well as the estimated effort to access the source. Finally, the table gives an example of an actor with access to the source. Next we focus on a number of sources of WOs that we find particularly relevant: several due to DNS in Section~\ref{cat:sec:sources:dns}, the DHT of Tor onion directory services in Section~\ref{cat:sec:sources:dht}, and real-time bidding platforms in Section~\ref{cat:sec:sources:rtb}. \begin{sidewaystable} \caption{Comparison of a number of WO sources based on their \emph{estimated} time of availability (when attacker likely has to collect data, i.e., retroactively or real-time), False Positive Rate (FPR), coverage of website/network visits, and primary entities with access.} \centering \label{cat:table:sources} \begin{tabular}{l c c c c r} Source & Availability & FPR & Coverage & Effort & Access \\ \midrule Dragnet surveillance programmes & retroactive & negl. & high & high & intelligence agencies \\ Content Delivery Networks & retroactive & negl. & high & high & operators\\ Real-time bidding & real-time (retroactive) & negl. & high & modest & customers (operator)\\ Webserver access logs & retroactive & negl. & high & medium & operators\\ Middleboxes & retroactive~\cite{bluecoat} & negl. & medium & medium & operators \\ OCSP & retroactive & low & high & medium & few CAs, plaintext\\ 8.8.8.8 operator & retroactive & low~\cite{DBLP:conf/ndss/GreschbachPRWF17} & 16.8\% of visits & high & Google, plaintext \\ 1.1.1.1 operator & retroactive & low~\cite{DBLP:conf/ndss/GreschbachPRWF17} & 7.4\% of visits & high & Cloudflare, plaintext \\ Exit relays & real-time & negl. & low & low & operators \\ Exit relays DNS cache & real-time & medium & high & medium & anyone\\ Query DNS resolvers & real-time & high & low & low & anyone \\ Onion v2 (v3) & real-time & negl. & high (low) & low (high) & anyone \\ % & & & & & \\ \end{tabular} \end{sidewaystable} \subsection{DNS} \label{cat:sec:sources:dns} Before a website visit the corresponding domain name must be resolved to an IP address. For a user that uses Tor browser, the exit relay of the current circuit resolves the domain name. If the DNS record of the domain name is already cached in the DNS cache of the exit relay, then the exit relay uses that record. Otherwise the domain name is resolved and subsequently cached using whichever DNS resolution mechanism that the exit relay has configured. Based on this process we present three sources of WOs that work for unpopular websites. \subsubsection{Shared Pending DNS Resolutions} If an exit relay is asked to resolve a domain name that is uncached it will create a list of pending connections waiting for the domain resolution to finish. If another connection asks that the same domain name be resolved, it is added to the list of pending connections. When a result is available all pending connections are informed. This is the basis of a WO: if a request to resolve a domain name returns a record \emph{more quickly than previously measured by the attacker for uncached entries}, the entry was either pending resolution at the time of the request or already cached. Notably this works regardless of if exit relays have DNS caches or not. However, the timing constraints of shared pending connections are significant and thus a practical hurdle to overcome. \subsubsection{Tor's DNS Cache at Exit Relays} If an unpopular website is visited by a user, the resolved domain name will likely be cached by a \emph{single} relay. We performed 411 \texttt{exitmap}~\cite{DBLP:conf/pet/WinterKMHSLW14} measurements between April 1--10 (2019), collecting on average 3544 (un)cached data points for each exit using a domain under our control that is not used by anyone else. Given a labelled data set of (un)cached times for each exit relay, we can construct distinct \emph{per-relay} classifiers that predict whether a measured time corresponds to an (un)cached domain name. While there are many different approaches that could be used to build such a classifier, we decided to use a simple heuristic that should result in little or no false positives: output `cached' iff no uncached query has \emph{ever} been this fast before. Figure~\ref{cat:fig:dns:classifier-idea} shows the idea of this classifier in greater detail, namely create a \emph{threshold} that is the minimum of the largest cached time and the smallest uncached time and then say cached iff the measured time is smaller than the threshold. Regardless of how well this heuristic performs (see below), it should be possible to construct other classifiers that exploit the trend of smaller resolve times and less standard deviation for cached queries (Figure~\ref{cat:fig:dns:dist}). For example, 69.1\% of all exit relays take at least 50~ms more time to resolve an uncached domain on average. \begin{figure}[!t] \centering \includegraphics[width=.7\columnwidth]{src/cat/img/dns__classifier-idea} \caption{The two cases when deciding on a classifier's threshold.} \label{cat:fig:dns:classifier-idea} \end{figure} \begin{figure}[!t] \centering \includegraphics[width=.7\columnwidth]{src/cat/img/dns__timing-dist} \caption{% The difference between (un)cached standard deviation and mean times without any absolute values, i.e., a negative value implies that the uncached time is smaller than the cached time. } \label{cat:fig:dns:dist} \end{figure} To estimate an \emph{upper bound} on how effective the composite classifier of all per-relay classifiers could be \emph{without any false positives} using our heuristic, we applied ten-fold cross-validation to simply exclude every exit relay that had false positives during any fold and then weighted the observed bandwidth for the remaining classifiers by the individual true positive rates. This gives us an estimate of how much bandwidth we could predict true positives for without having any false positives. By comparing it to the total exit bandwidth of the Tor network, we obtain an estimated upper bound true positive rate for the composite classifier of $17.3$\%. When an attacker measures if a domain is cached or not the domain will, after the measurement, be cached for up to an hour (current highest caching duration in Tor, independent of TTL) at every exit. However, if a an attacker can cause an exit to run low on memory, the entire DNS cache will be removed (instead of only parts of it) due to a bug in the out-of-memory manager of Tor. We have reported this to the Tor Project~\cite{bug-report}. We further discuss in Section~\ref{cat:sec:disc} how frequently an attacker on average can be expected to query a WO. \subsubsection{Caching at Recursive DNS Resolvers} For a website that is unpopular enough, there is a high chance that nobody on the web visited the website within a given timeframe. This is the basis of our next idea for a WO which is \emph{not mutually exclusive} to the Tor network: wait a couple of seconds after observing a connection, then probe all recursive DNS resolvers of Tor exits that can be accessed to determine whether any monitored website was cached approximately at the time of observing the connection by inspecting TTLs. %the TTL of returned DNS records. In 2016 Greschbach~\emph{et~al.}~\cite{DBLP:conf/ndss/GreschbachPRWF17} showed that remote DNS resolvers like Google's \texttt{8.8.8.8} receive a large portion of all DNS traffic that exit relays generate. To better understand how the DNS resolver landscape looks today, we repeated their RIPE Atlas experiment setup for 35 hours in February 17--18 (2019), measuring every 30 minutes. Our results show that Google (16.8\%) and Cloudflare (7.4\%) are both popular. Many exits use a same-AS resolver which is presumably the ISP (42.3\%), while other exits resolve themselves (15.2\%) or use a remote DNS resolver that we did not identify (18.2\%). Further, we note that there are at least one RIPE Atlas network measurement probe in the same AS as 53.3\% of all exits, providing access to many of the same DNS resolvers as used by exits from a similar network vantage point. Instead of using RIPE Atlas nodes we opted for a different approach which is \emph{strictly worse}: query Google's and Cloudflare's DNS resolvers from VMs in 16 Amazon EC2 regions. With a simple experiment of first visiting a unique domain (once again under our control and only used by us) using \texttt{torify curl} and then querying the DNS resolvers from each Amazon VM to observe TTLs, we got true positive rates of 2.9\% and 0.9\% for Google and Cloudflare with 1000 repetitions. While this may seem low, the cost for an attacker is at the time of writing about 2 USD per day using on-demand pricing. Using an identical setup we were also able to find a subset of monitored websites that yield alarmingly high true positive rates: 61.4\% (Google) and 8.0\% (Cloudflare). Presumably this was due to the cached entries being shared over a wider geographical area for some reason (however, not globally). Regardless, coupled with the fact that anyone can \emph{globally purge} the DNS caches of Google~\cite{google-purge} and Cloudflare~\cite{cf-purge} for arbitrary domain names, this is a noteworthy WO source. \subsection{Onion Service Directories in Tor} \label{cat:sec:sources:dht} To access an onion service a user first obtains the service's \emph{descriptor} from a Distributed Hash Table (DHT) maintained by \emph{onion service directories}. From the descriptor the user learns of \emph{introduction points} selected by the host of the onion service in the Tor network that are used to establish a connection to the onion service in a couple of more steps~\cite{onionv2,onionv3} that are irrelevant here. Observing a request for the descriptor of a monitored onion service is a source for a WO. To observe visits for a target (known) onion service in the DHT, a relay first has to be selected as one out of six or eight (depending on version) relays to host the descriptor in the DHT, and then the victim has to select that relay to retrieve the descriptor. For v2 of onion services, the location in the DHT is deterministic~\cite{onionv2} and an attacker can position its relays in such a way to always be selected for hosting target descriptors. Version 3 of onion services addresses this issue by randomising the process every 24 hours~\cite{onionv3}, forcing an attacker to host a significant number of relays to get a WO for onion services with high coverage. At the time of writing, there are about 3,500 relays operating as onion service directories. \subsection{Real-Time Bidding} \label{cat:sec:sources:rtb} Real-Time Bidding (RTB) is an approach towards online advertisement that allows a publisher to auction ad space to advertisers on a \emph{per-visit} basis in real time~\cite{rtb}. Google's Display Network includes more than two million websites that reach 90\% of all Internet users~\cite{google-dn}, and an advertiser that uses RTB must respond to submitted bid requests containing information such as the three first network bytes of an IPv4 address, the second-level domain name of the visited website, and the user agent string within $\approx$100~ms~\cite{google-bid}. While the exact information available to the bidder depends on the ad platform and the publisher's advertisement settings, anonymous modes provide less revenue~\cite{google-bid-anon}. Combined with many flavours of pre-targeting such as IP and location filtering~\cite{DBLP:conf/wpes/VinesRK17}, it is likely that the bidder knows whether a user used Tor while accessing a monitored website. Vines et al.~\cite{DBLP:conf/wpes/VinesRK17} further note that ``35\% of the DSPs also allow arbitrary IP white-and blacklisting (Admedo, AdWords, Bing, BluAgile, Criteo, Centro, Choozle, Go2Mobi, Simpli.fi)''. Finally, observe that an attacker need not win a bid to use RTB as a WO.