Import PhD thesis

author: Rasmus Dahlberg <rasmus@rgdd.se> 2024-10-15 16:08:16 +0200
committer: Rasmus Dahlberg <rasmus@rgdd.se> 2024-10-15 16:08:16 +0200
commit: 385cc92bc91e1a6c3724085c060e76bf40c13ed3 (patch)
tree: 26d0a8f81f2caa472830fd40a51844bb202c1355 /summary/src/introduction/main.tex
1 files changed, 826 insertions, 0 deletions
diff --git a/summary/src/introduction/main.tex b/summary/src/introduction/main.tex
new file mode 100644
index 0000000..da013ab
--- /dev/null
+++ b/summary/src/introduction/main.tex
@@ -0,0 +1,826 @@
+\section{Introduction}
+
+The security posture of the Internet increased significantly throughout the
+last decade.  For example,
+  the cleaned-up and formally verified TLS 1.3 protocol that underpins HTTPS
+    has been rolled-out gradually~\cite{tls-timeline},
+  the certificates that specify which public keys to use when bootstrapping a
+    secure connection can be obtained for free and automatically~\cite{le}, and
+  web browsers have shifted from positive to negative security indicators in
+    favor of security-by-default~\cite{browser-ui}.
+The use of end-to-end encryption has further become the norm with services such
+as
+  DNS-over-HTTPS~\cite{rfc8484},
+  virtual private networks~\cite{wireguard},
+  Tor~\cite{tor}, and
+  secure messaging~\cite{mls}
+gaining traction.  In other words, the era of attackers that can passively snoop
+and actively tamper with unencrypted~network~traffic~is~over.
+
+What will remain the same is the incentive for attackers to snoop and tamper
+with network traffic.  Therefore, the focus is (and will likely continue to be)
+on circumventing protocols that add security and privacy as they are deployed
+in the real world.  For example, there is a long history of certificate
+mis-issuance that allows attackers to impersonate websites and thus insert
+themselves as machines-in-the-middle (``MitM'') without actually breaking
+TLS~\cite{sok-https,sslmate-history}.  Or, in the case of encrypted channels
+that are hard to intercept, instead analyzing traffic patterns to infer user
+activity like which website is being
+visited~\cite{cheng98,herrmann09,hintz02,liberatore06,panchenko11,sun02}.  The
+bad news is that attackers only need to find one vulnerability in a deployed
+protocol or its software.  Sometimes, such vulnerabilities can be purchased by
+zero-day brokers like Zerodium~\cite{zerodium}.
+
+To address an attack vector, it is common to add countermeasures that frustrate
+attackers and/or increase the risk involved while trying to exploit a system.
+A good example is how the certificate authority ecosystem evolved.  For
+background, certificate authorities are trusted parties that validate domain
+names before issuing certificates that list their public keys.  Web browsers
+are shipped with hundreds of trusted certificate authorities, which means that
+the resulting TLS connections cannot be more secure than the difficulty of
+hijacking the weakest-link certificate authority~\cite{sok-https}.  A proposal
+eventually deployed to mitigate this issue is Certificate Transparency: an
+ecosystem of public append-only logs that publishes all certificates so that
+any mis-issuance can be \emph{detected} by monitors~\cite {ct,rfc6962}.
+These logs have a cryptographic foundation that holds them and the issuing
+certificate authorities \emph{accountable}, at least in theory.  In practice,
+the logs are essentially trusted parties that must act honestly due to how web
+browsers shape their policies to respect user
+privacy~\cite{apple-log-policy,google-log-policy,sok-sct-auditing,ct-history}.
+
+The first objective of this thesis is to better understand the current limits
+of Certificate Transparency by proposing and evaluating improvements which
+\emph{reduce} the amount of trust that needs to be placed in third-party
+monitors and logs.  We make a dent in the problem of Certificate Transparency
+verification both generally and concretely in the context of Tor Browser, which
+unlike Google Chrome and Apple's Safari does not support Certificate
+Transparency yet.  For context, Tor Browser is a fork of
+Mozilla's Firefox that (among other things) routes user traffic through the
+low-latency anonymity network Tor~\cite{tor,tb}.  As part of our pursuit to
+improve the status quo for Certificate Transparency verification in Tor
+Browser, the second objective of this thesis is to evaluate how the protocols
+used during website visits affect unlinkability between senders (web browsers)
+and receivers (websites).  Our evaluation applies to our addition of
+Certificate Transparency and other protocols already in use, e.g.,
+  DNS,
+  real-time bidding~\cite{rtb}, and
+  certificate revocation checking~\cite{ocsp}.
+
+The remainder of the introductory summary is structured as follows.
+Section~\ref{sec:background} introduces background that will help the reader
+  understand the context and preliminaries of the appended papers.
+Section~\ref{sec:rqs} defines our research questions and overall objective.
+Section~\ref{sec:methods} provides an overview of our research methods.
+Section~\ref{sec:contribs} describes our contributions succinctly.
+Section~\ref{sec:appended} summarizes the appended papers that are published in
+  NordSec (Paper~\ref{paper:lwm}),
+  SECURWARE (Paper~\ref{paper:ctga}),
+  PETS (Paper~\ref{paper:ctor} and~\ref{paper:cat}),
+  WPES (Paper~\ref{paper:sauteed}), and
+  USENIX Security (Paper~\ref{paper:tlwo}).
+Section~\ref{sec:related} positions our contributions with regard to related
+work.  Section~\ref{sec:concl} concludes and briefly discusses future work.
+
+\section{Background} \label{sec:background}
+
+This section introduces background on Certificate Transparency and Tor.
+
+\subsection{Certificate Transparency}
+
+The web's public-key infrastructure depends on certificate authorities to issue
+certificates that map domain names to public keys.  For example, the
+certificate of \texttt{www.example.com} is issued by DigiCert and lists a
+2048-bit RSA key~\cite{crt:www.example.com}.  The fact that DigiCert signed
+this certificate means that they claim to have verified that the requesting
+party is really \texttt{www.example.com}, typically by first ensuring that a
+specified DNS record can be uploaded on request~\cite{ca/b}.  If all
+certificate authorities performed these checks correctly and the checks
+themselves were fool-proof, a user's browser could be sure that any
+certificate signed by a certificate authority would list a verified public key
+that can be used for authentication when connecting to a website via TLS.
+Unfortunately, there are hundreds of trusted certificate authorities and a long
+history of issues surrounding their operations in
+practice~\cite{bambo-cas,sok-https,sslmate-history}.  One of the most famous
+incidents took place in 2011: an attacker managed to mis-issue certificates from
+DigiNotar to intercept traffic towards Google and others in
+Iran~\cite{black-tulip}.  The astonishing part is that this incident was first
+detected \emph{seven weeks later}.
+
+Certificate Transparency aims to facilitate \emph{detection} of issued
+certificates, thus holding certificate authorities \emph{accountable} for any
+certificates that they mis-issue~\cite{ct,rfc6962}.  The basic idea is shown in
+Figure~\ref{fig:ct-idea}.  In addition to regular validation rules, browsers
+ensure certificates are included in a public append-only Certificate
+Transparency \emph{log}.
+This allows anyone to get a concise view of all certificates that users
+may encounter, including domain owners like Google who can then see for
+themselves whether any of the published certificates are mis-issued.  The
+parties inspecting the logs are called \emph{monitors}.  Some monitors mirror
+all log entries~\cite{crt.sh}, while others discard most of them in pursuit of
+finding matches for pre-defined criteria like
+\texttt{*.example.com}~\cite{certspotter}.  Another option is subscribing to
+certificate notifications from a trusted third-party~\cite{ct-monitors}.
+
+\begin{figure}[!t]
+  \centering\includegraphics[width=0.8\textwidth]{src/introduction/img/ct}
+  \caption{%
+    The idea of Certificate Transparency.  Certificates encountered by users
+    must be included in a public log so that monitors can detect mis-issuance.
+  }
+  \label{fig:ct-idea}
+\end{figure}
+
+What makes Certificate Transparency a significant improvement compared to the
+certificate authority ecosystem is that the logs stand on a cryptographic
+foundation that can be verified.  A log can be viewed as an append-only
+tamper-evident list of certificates.  It is efficient\footnote{%
+  Efficient refers to space-time complexity $\mathcal{O}(\mathsf{log}(n))$,
+  where $n$ is the number of log entries.
+} to prove cryptographically that a certificate is in the list, and that a
+current version of the list is append-only with regard to a previous version
+(i.e., no tampering or reordering).\footnote{%
+  Interested readers can refer to our Merkle tree and proof technique
+  introduction online~\cite{merkle-intro}.
+} These properties follow from using a Merkle tree structure that supports
+\emph{inclusion} and \emph{consistency}
+proofs~\cite{history-trees,ct-formal,rfc6962,merkle}.
+The reader
+only needs to know that these proofs are used to reconstruct a log's Merkle tree
+head, often referred to as a \emph{root hash}.  It is a cryptographic hash
+identifying a list of certificates uniquely in a tree data structure.  The logs
+sign root hashes with the number of entries and a timestamp to form \emph{signed
+tree heads}.  So, if an inconsistency is discovered, it cannot be denied.  Log
+operators are therefore held accountable for maintaining the append-only
+property.  A party that verifies the efficient transparency log proofs without
+downloading all the logs is called an \emph{auditor}.
+
+A log that signs two inconsistent tree heads is said to perform a
+\emph{split-view}.  To ensure that everyone observes the same append-only logs,
+all participants of the Certificate Transparency ecosystem must engage in a
+\emph{gossip protocol}~\cite{chuat,nordberg}.  In other words, just because
+Alice observes an append-only log, it is not necessarily the \emph{same
+append-only log} that Bob observes.  Therefore, Alice and Bob must exchange
+signed tree heads and verify consistency to assert that the log operators play
+by the rules and only append certificates.  Without a secure gossip protocol,
+log operators would have to be trusted blindly (much like certificate
+authorities before Certificate Transparency).  RFC~6962 defers the specification
+of gossip~\cite{rfc6962}, with little or no meaningful gossip deployed yet.
+
+Rolling out Certificate Transparency without breakage on the web is a
+challenge~\cite{does-ct-break-the-web}.  Certificates must be logged, associated
+proofs delivered to end-user software, and more.  One solution RFC~6962
+ultimately put forth was the introduction of \emph{signed certificate
+timestamps}.  A signed certificate timestamp is a log's \emph{promise} that a
+certificate will be appended to the log within a \emph{maximum merge delay}
+(typically 24 hours).  Verifying if a log holds its promise is
+usually called \emph{auditing}.
+Certificate authorities can obtain signed certificate
+timestamps and embed them in their final certificates by logging a
+\emph{pre-certificate}.  As such, there is no added latency from building the
+underlying Merkle tree and no need for server software to be updated (as the
+final certificate contains the information needed).  The current policy for
+Google Chrome and Apple's Safari is to reject certificates with fewer than two
+signed certificate timestamps~\cite{apple-log-policy,google-log-policy}.  How to
+request an inclusion proof for a promise without leaking the user's browsing
+history to the log is an open problem~\cite{sok-sct-auditing}.  In other words,
+asking for an inclusion proof trivially reveals the certificate of interest to
+the log.
+
+Other than embedding signed certificate timestamps in certificates, they can be
+delivered dynamically to end-users in TLS extensions and stapled certificate
+status responses.  For example, Cloudflare uses the TLS extension delivery
+method to recover from log incidents without their customers needing to acquire
+new certificates~\cite{cloudflare-scts}.  Several log incidents have already
+happened in the past, ranging from
+split-views~\cite{trustasia-err,izenpe-err,venafi-err} to broken promises of
+timely logging~\cite{wosign-err,google-err,digicert-err,starcom-err} and
+potential key compromise~\cite{digicert-kc}.  These are all good \emph{scares}
+motivating continued completion of Certificate Transparency in practice.
+
+In summary, the status quo is for web browsers to require at least two signed
+certificate timestamps before accepting a certificate as valid.  Merkle tree
+proofs are not verified.  Gossip is not deployed.  The lack of a reliable
+\emph{gossip-audit model} means that the logs are largely trusted
+parties.\footnote{%
+  Historical remark: the lack of verification led Google to require that all
+  certificates be disclosed in at least one of their logs to
+  validate~\cite{ct-history}.  The so-called \emph{one-Google log requirement}
+  was recently replaced.  Google Chrome instead interacts with Google's trusted
+  auditor.  See Section~\ref{sec:related}.
+} We defer discussion of related work in the area of gossip-audit models until
+Section~\ref{sec:related}.
+
+\subsection{Tor}
+
+The Tor Project is a 501(c)(3) US nonprofit that advances human rights and
+defends privacy online through free software and open networks~\cite{tpo}.  Some
+of the maintained and developed components include Tor Browser and Tor's relay
+software.  Thousands of volunteers operate relays as part of the Tor network,
+which routes the traffic of millions of daily users with low
+latency~\cite{mani}.  This frustrates attackers like Internet service providers
+that may try linking \emph{who is communicating with whom} from their local
+(non-global) vantage points~\cite{tor}.
+
+Usage of Tor involves tunneling the TCP traffic of different destinations (such
+as all flows associated with a website visit to \texttt{example.com}) in
+fixed-size \emph{cells} on independent \emph{circuits}.  A circuit is built
+through a guard, a middle, and an exit relay.  At each hop of the circuit, one
+layer of symmetric encryption is peeled off.  The used keys are ephemeral and
+discarded together with all other circuit state after at most 10 minutes (the
+maximum circuit lifetime).  This setup allows guard relays to observe users' IP
+addresses but none of the destination traffic.  In contrast, exit relays can
+observe destination traffic but no user IP addresses.  The relays used in a
+circuit are determined by Tor's end-user software.  Such path selection
+is randomized and bandwidth-weighted but starts with a largely static guard set
+to protect users from \emph{eventually} entering the network from a relay an
+attacker volunteered to run.
+
+Tor's \emph{consensus} lists the relays that make up the network.  As the name
+suggests, it is a document agreed upon by a majority of trusted \emph{directory
+authorities}.  Five votes are currently needed to reach a consensus.  Examples
+of information added to the Tor consensus include tunable network parameters and
+uploaded relay descriptors with relevant metadata, e.g., public key, available
+bandwidth, and exit policy.  Each relay in the consensus is also assigned
+different flags based on their configuration and observed performance, e.g.,
+\texttt{Guard}, \texttt{MiddleOnly}, \texttt{Fast}, \texttt{Stable}, and
+\texttt{HSDir}.  The latter means that the relay is a \emph{hidden service
+directory}, which refers to being part of a distributed hash table that helps
+users lookup \emph{onion service introduction points}.
+
+An onion service is a self-authenticated server identified by its public key.
+Onion services are only reachable through the Tor network.  Users that are aware
+of a server's \emph{onion address} can consult the distributed hash table to
+find its introduction points.  To establish a connection, a user builds a
+circuit to a \emph{rendezvous point}.  A request is then sent to one of the
+current introduction points, which informs the onion service that it may build
+its own circuit to meet the user at their rendezvous point.  In total, six
+relays are traversed while interacting with an onion service.  This setup allows
+not only the sender but also the receiver to be anonymous.  The receiver also
+benefits from a large degree of censorship resistance as the server location may
+be hidden.  The main drawback of onion services is that their non-mnemonic names
+are hard to discover and remember.  Some sites try to overcome this by setting
+their onion addresses in \emph{onion location} HTTP headers or HTML
+attributes~\cite{onion-location}.
+
+Many users use Tor Browser to connect to the Tor network.  In addition to
+routing traffic as described above, Tor Browser ships with privacy-preserving
+features like first-party isolation to not share any state across
+different origins, settings that frustrate browser fingerprinting, and
+\emph{disk-avoidance} to not store browsing-related history as well as other
+identifying information to disk~\cite{tb}.  Tor Browser is a fork of Mozilla's
+Firefox.  Unfortunately, neither Firefox nor Tor Browser supports any form of
+Certificate Transparency.  Conducting undetected machine-in-the-middle attacks
+against Tor users is thus relatively straightforward: compromise or coerce the
+weakest-link certificate authority, then volunteer to operate an exit relay and
+intercept network traffic.  Such interception has previously been found with
+self-signed certificates~\cite{spoiled-onions}.
+
+While global attackers are not within Tor's threat model, it is in scope to
+guard against various local attacks~\cite{tor}.  For example, the intended
+attacker may passively observe a small fraction of the network and actively
+inject their own packets.  Figure~\ref{fig:wf} shows the typical attacker
+setting of \emph{website fingerprinting}, where the attacker observes a user's
+entry traffic with the goal of inferring which website was visited solely based
+on analyzing encrypted
+traffic~\cite{cheng98,herrmann09,hintz02,liberatore06,panchenko11,sun02}.
+Website fingerprinting attacks are evaluated in the \emph{open-world} or
+\emph{closed-world} settings.  In the closed-world setting, the attacker
+monitors (not to be confused with Certificate Transparency monitoring) a fixed
+list of websites.  A user visits one of the monitored sites, and the attacker
+needs to determine which one.  The open-world setting is the same as the
+closed-world setting, except that the user may also visit unmonitored sites.
+The practicality of website fingerprinting attacks is up for debate, e.g.,
+ranging from challenges handling false positives to machine-learning dataset
+drift~\cite{onlinewf,juarez14,perryCrit,realistic}.
+
+\begin{figure}[!t]
+  \centering\includegraphics[width=0.85\textwidth]{src/cat/img/setting}
+  \caption{
+    The setting of a website fingerprinting attack.  A local passive attacker
+    analyzes a user's encrypted network traffic as it enters the network.  The
+    goal is to infer which website is visited.  (Figure reprinted from
+    Paper~\ref{paper:cat}.)
+  }
+  \label{fig:wf}
+\end{figure}
+
+In summary, Tor is a low-latency anonymity network often accessed with Tor
+Browser.  Among the threats that Tor aims to protect against are local attackers
+that see traffic as it enters or leaves the network (but not both at the same
+time all the time).  A website fingerprinting attack is an example of a passive
+attack that operates on entry traffic.  A machine-in-the-middle attack is an
+example of an active attack that typically operates on exit traffic.  Discussion
+of related work in the area of website fingerprinting is deferred until
+Section~\ref{sec:related}.
+
+\section{Research Questions} \label{sec:rqs}
+
+The overall research objective spans two different areas: transparency logs and
+low-latency anonymity networks.  We aim to reduce trust assumptions in
+transparency log solutions and to apply such solutions in anonymous settings for
+improved security and privacy.  We defined the following research questions to
+make this concrete in Certificate Transparency and Tor, the two ecosystems with
+the most history and dominant positions in their respective areas.
+
+\begin{researchquestions}
+  \item[Can trust requirements in Certificate Transparency be reduced in
+    practice?]
+
+  Transparency logs have a cryptographic foundation that supports efficient
+  verification of inclusion and consistency proofs.  Such proofs are useful to
+  reduce the amount of trust that is placed in the logs.  The roll-out of
+  Certificate Transparency has yet to start using these proofs, and to employ a
+  gossip protocol that ensures the same append-only logs are observed.  Part of
+  the challenge relates to privacy concerns as parties interact with each other,
+  as well as deploying gradually without breakage.
+
+  We seek practical solutions that reduce the trust requirements currently
+  placed in the logs and third-party monitors while preserving user privacy.
+
+  \item[How can authentication of websites be improved in the context of Tor?]
+
+  Tor Browser has yet to support Certificate Transparency to facilitate
+  detection of hijacked websites.  This includes HTTPS sites but also onion
+  services that may be easier to discover reliably with more transparency.
+
+  We seek incremental uses of Certificate Transparency in Tor that preserve user
+  privacy while engaging in new verification protocols to reduce trust.
+
+  \item[How do the protocols used during website visits affect
+    unlinkability between Tor users and their destination websites?]
+
+  Several third-parties become aware of a user's browsing activities while a
+  website is visited.  For example, DNS resolvers and certificate status
+  responders may be consulted for domain name resolution and verification of if
+  a certificate has been revoked.  Fetching an inclusion proof from a
+  Certificate Transparency log would reveal the same type of information.
+
+  We seek to explore how unlinkability between Tor users and their exit
+  destinations is affected by the multitude of protocols used during website
+  visits.  The considered setting is the same as in website fingerprinting,
+  except that the attacker may take additional passive and active measures.
+  For example, the attacker may volunteer to run a Certificate Transparency log
+  (passive) or inject carefully-crafted packets into Tor~(active).
+\end{researchquestions}
+
+\section{Research Methods} \label{sec:methods}
+
+We tackle privacy and security problems in the field of computer
+science~\cite{icss,smics}.  Our work is applied, following the scientific method
+for security and experimental networking research.  \emph{Exactly} what it means
+to use the scientific method in these areas is up for debate~\cite{rfenr,sse}.
+However, at a glance, it is about forming precise and consistent theories with
+falsifiable predictions \emph{as in other sciences} except that the objects of
+study are \emph{information systems in the real world}.
+
+A prerequisite to formulating precise, consistent, and falsifiable theories is
+that there are few implicit assumptions.  Therefore, scientific security
+research should be accompanied by definitions of security goals and attacker
+capabilities: what does it mean that the system is secure, and what is the
+attacker (not) allowed to do while attacking it~\cite{secdefs}?  Being explicit about the
+overall \emph{setting} and \emph{threat model} is prevalent in formal security
+work like cryptography, where an abstract (mathematical) model is used to show
+that security can be \emph{proved} by reducing to a computationally hard problem
+(like integer factorization) or a property of some primitive (like the collision
+resistance of a hash function)~\cite{provsec}.  It is nevertheless just as crucial in less
+formal work that deals with security of systems in the real (natural)
+world---the exclusive setting of the scientific method---which usually lends
+itself towards break-and-fix cycles in light of new observations.  Where to draw
+the line between \emph{security work} and \emph{security research} is not
+trivial.  However, a few common \emph{failures} of past ``security research''
+include not bringing observations in contact with theory, not making claims and
+assumptions explicit, or simply relying on unfalsifiable claims~\cite{sse}.
+
+While deductive approaches (like formal reduction proofs) are instrumental in
+managing complexity and gaining confidence in different models, more than these
+approaches are required as a model's \emph{instantiation} must also be secure~\cite{secdefs}.
+It is common to complement abstract modeling with \emph{real-world measurements}
+as well as \emph{systems prototyping and evaluations}~\cite{rfenr}.  Real-world
+measurements measure properties of deployed systems like the Internet, the web,
+and the Tor network.  For example, a hypothesis in a real-world measurement
+could be that (non-)Tor users browse according to the same website popularity
+distribution.  Sometimes these measurements involve the use of research
+prototypes, or the research prototypes themselves become the objects of study to
+investigate properties of selected system parts (say, whether a packet processor
+with new features is indistinguishable from some baseline as active network
+attackers adaptively inject packets of their choosing).  If it is infeasible,
+expensive, or unsafe (see below) to study a real-world system, a simulation may
+be studied instead.  The downside of simulation is that the model used may not
+be a good approximation of the natural world, similar to formal
+cryptographic~modeling.
+
+The appended papers use all of the above approaches to make claims about
+security, privacy, and performance in different systems, sometimes with regard
+to an abstract model that can be used as a foundation in the natural world to
+manage complexity.  Paper~\ref{paper:lwm} contains a reduction proof sketch to
+show reliance on standard cryptographic assumptions.  Paper~\ref{paper:cat}
+extends past simulation setups to show the impact of an added attacker
+capability.  Meanwhile, Paper~\ref{paper:ctor} models part of the Tor network
+with mathematical formulas to estimate performance overhead.  All but
+Paper~\ref{paper:sauteed} contain real-world measurements relating to Internet
+infrastructure, websites, certificates, Tor, or practical deployability of our
+proposals.  All but Paper~\ref{paper:ctor} contain research prototypes with
+associated evaluations, e.g., performance profiling, as well as corroborating or
+refuting our security definitions in experimental settings.  All papers include
+discussions of security and privacy properties as well as their limitations and
+strengths in the chosen settings (where assumptions are explicit and threat
+models motivated).
+
+Throughout our experiments, we strived to follow best practices like documenting
+the used setups, making datasets and associated tooling available, reducing
+potential biases by performing repeated measurements from multiple different
+vantage points, and discussing potential biases (or lack thereof)~\cite{rfenr}.
+We also interacted with Tor's research safety board~\cite{trsb} to discuss the
+ethics and safety of our measurements in Paper~\ref{paper:tlwo}, and refrained
+from measuring real (i.e., non-synthetic) usage of Tor whenever possible
+(Papers~\ref{paper:ctor} and~\ref{paper:cat}).  Finally, the uncovered bugs and
+vulnerabilities in Papers~\ref{paper:cat}--\ref{paper:tlwo} were responsibly
+disclosed to the Tor project.  This included suggestions on how to move forward.
+
+\section{Contributions} \label{sec:contribs}
+
+The main contributions of this thesis are listed below.  An overview of how they
+relate to our research questions and appended papers is shown in
+Figure~\ref{fig:contrib}.
+
+\begin{figure}[!t]
+  \centering
+  \includegraphics[width=0.83\textwidth]{src/introduction/img/contribs}
+  \caption{%
+    Overview of appended papers, contributions, and research questions.
+  }
+  \label{fig:contrib}
+\end{figure}
+
+\vspace{1cm}
+\begin{contributions}
+  \item[Reduced trust in third-party monitoring with a signed tree head
+    extension that shifts trust from non-cryptographic certificate notifications
+    to a log's gossip-audit model (or if such a model does not exist yet, the
+    logs themselves).]
+
+    Paper~\ref{paper:lwm} applies existing cryptographic techniques for
+    constructing static and lexicographically ordered Merkle trees so that
+    certificates can be wild-card filtered on subject alternative names with
+    (non-)membership proofs.  This building block is evaluated in the context of
+    Certificate Transparency, including a security sketch and performance
+    benchmarks.
+
+  \item[Increased probability of split-view detection by proposing gossip
+    protocols that disseminate signed tree heads without bidirectional
+    communication.]
+
+    Paper~\ref{paper:ctga} explores aggregation of signed tree heads at line
+    speed in programmable packet processors, facilitating consistency proof
+    verification on the level of an entire autonomous system.  Such verification
+    can be indistinguishable from an autonomous system without any split-view
+    detection to achieve herd immunity, i.e., protection without aggregation.
+    Aggregation at 32 autonomous systems can protect 30-50\% of the IPv4 space.
+    Paper~\ref{paper:ctor} explores signed tree heads in Tor's consensus.  To
+    reliably perform an undetected split-view against log clients that have Tor
+    in their trust root, a log must collude with a majority of directory
+    authorities.
+
+  \item[Improved detectability of website hijacks targeting Tor Browser by
+    proposing privacy-preserving and gradual roll-outs of Certificate
+    Transparency in Tor.]
+
+    Paper~\ref{paper:ctor} explores adoption of Certificate Transparency in Tor
+    Browser with signed certificate timestamps as a starting point, then
+    leveraging the decentralized network of relays to cross-log certificates
+    before ultimately verifying inclusion proofs against a single view in Tor's
+    consensus.  The design is probabilistically secure with tunable parameters
+    that result in modest overheads.  Paper~\ref{paper:sauteed} shows that
+    Certificate Transparency logging of domain names with associated onion
+    addresses helps provide forward censorship-resistance and detection of
+    unwanted onion associations.
+
+  \item[An extension of the attacker model for website fingerprinting that
+    provides attackers with the capability of querying a website oracle.]
+
+    A website oracle reveals whether a monitored website was (not) visited by
+    any network user during a specific time frame.  Paper~\ref{paper:cat}
+    defines and simulates website fingerprinting attacks with website oracles,
+    showing that most false positives can be eliminated for all but the most
+    frequently visited websites.  A dozen sources of real-world website oracles
+    follow from the protocols used during website visits.  We enumerate and
+    classify those sources based on ease of accessibility, reliability, and
+    coverage.  The overall analysis includes several Internet measurements.
+
+  \item[Remotely-exploitable probing-attacks on Tor's DNS cache that instantiate
+    a real-world website oracle without any special attacker capabilities or reach.]
+
+    Paper~\ref{paper:cat} shows that timing differences in end-to-end response
+    times can be measured to determine whether a domain name is (not) cached by
+    a Tor relay.  An estimated true positive rate of 17.3\% can be achieved
+    while trying to minimize false positives.  Paper~\ref{paper:tlwo} improves
+    the attack by exploiting timeless timing differences that depend on
+    concurrent processing.  The improved attack has no false positives or false
+    negatives.  Our proposed bug fixes and mitigations have been merged in Tor.
+
+  \item[A complete redesign of Tor's DNS cache that defends against all (timeless)
+    timing attacks while retaining or improving performance compared~to~today.]
+
+    Paper~\ref{paper:tlwo} suggests that Tor's DNS cache should only share the
+    same preloaded domain names across different circuits to remove the
+    remotely-probable state that reveals information about past exit traffic.  A
+    network measurement with real-world Tor relays shows which popularity lists
+    are good approximations of Tor usage and, thus, appropriate to preload.
+    Cache-hit ratios can be retained or improved compared to today's Tor.
+\end{contributions}
+
+\section{Summary of Appended Papers} \label{sec:appended}
+
+The appended papers and their contexts are summarized below.  Notably, all
+papers are in publication-date order except that Paper~\ref{paper:cat} predates
+Papers~\ref{paper:ctor}--\ref{paper:sauteed}.
+
+{
+  \hypersetup{linkcolor=black}
+  \listofsummaries
+}
+
+\section{Related Work} \label{sec:related}
+
+This section positions the appended papers with regard to related work.  For
+Certificate Transparency, this includes approaches towards signed certificate
+timestamp verification, gossip, and the problem of monitoring the logs.  The
+related work with regard to Tor is focused on the practicality of website
+fingerprinting attacks and prior use of side-channels (such as timing attacks).
+
+\subsection{Certificate Transparency Verification}
+
+Approaches that fetch inclusion proofs have in common that they should preserve
+privacy by not revealing the link between users and visited websites.
+Eskandarian~\emph{et~al.}\ mention that Tor could be used to overcome privacy
+concerns; however, it comes at the cost of added infrastructure
+requirements~\cite{eskandarian}.  Lueks and Goldberg~\cite{lueks} and
+Kales~\emph{et~al.}~\cite{kales} suggest that logs could provide inclusion
+proofs using multi-server private information retrieval.  This requires a
+non-collusion assumption while also adding significant overhead.
+Laurie suggests that users can fetch inclusion proofs via DNS as their resolvers
+already learned the destination sites~\cite{ct-over-dns}.  While surveying
+signed certificate timestamp auditing,
+Meiklejohn~\emph{et~al.}\ point out that Certificate Transparency over DNS may
+have privacy limitations~\cite{sok-sct-auditing}.  For example, the time of
+domain lookups and inclusion proof queries are detached.
+Paper~\ref{paper:ctga} uses Laurie's approach as a premise while
+proposing a gossip protocol.
+Paper~\ref{paper:ctor} applies Certificate Transparency in a context
+where Tor is not additional infrastructure~(Tor~Browser).
+
+Several proposals try to avoid inclusion proof fetching altogether.
+Dirksen~\emph{et~al.}\ suggest that all logs could be involved in the issuance of
+a signed certificate timestamp~\cite{dirksen}.  This makes it harder to violate
+maximum merge delays without detection but involves relatively large changes to
+log deployments.  Nordberg~\emph{et~al.}\ suggest that signed certificate
+timestamps can be handed back to the origin web servers on subsequent
+revisits~\cite{nordberg}, which has the downside of assuming that
+machine-in-the-middle attacks eventually cease for detection.
+Nordberg~\emph{et~al.}~\cite{nordberg} as well as Chase and
+Meiklejohn~\cite{chase} suggest that some clients/users could collaborate with a
+trusted auditor.  Stark and Thompson describe how users can opt-in to use Google
+as a trusted auditor~\cite{opt-in-sct-auditing}.  This approach was recently
+replaced by opt-out auditing that cross-checks a fraction of signed
+certificate timestamps with Google using
+k-anonymity~\cite{opt-out-sct-auditing}.  Henzinger~\emph{et al.} show how such
+k-anonymity can be replaced with a single-server private information retrieval
+setup that approaches the performance of prior multi-server
+solutions~\cite{henzinger}.  None of the latter two proposals provide a solution
+for privately reporting that a log may have violated its maximum merge delay
+because the trusted auditor is assumed to know about all signed certificate
+timestamps.  Eskandarian~\emph{et~al.}\ show how to prove that a log omitted a
+certificate privately~\cite{eskandarian}.  However, they use an invalid
+assumption about today's logs being in strict timestamp
+order~\cite{sok-sct-auditing}.  Paper~\ref{paper:ctor} suggests that Tor Browser
+could submit a fraction of signed certificate timestamps to randomly selected
+Tor relays.  These relays perform further auditing on Tor Browser's behalf: much
+like a trusted auditor, except that no single entity is running it.
+
+Merkle trees fix log content---not promises of logging.  Therefore,
+inclusion proof fetching by users or their trusted parties must be accompanied
+by consistency verification and gossip to get a complete gossip-audit
+model~\cite{rfc6962}.  Chuat~\emph{et~al.}\ suggest that users and web servers
+can pool signed tree heads, gossiping about them as they interact~\cite{chuat}.
+Nordberg~\emph{et~al.}\ similarly suggest that users can pollinate signed tree
+heads as they visit different web servers~\cite{nordberg}.  Hof and Carle
+suggest that signed tree heads could be cross-logged to make all logs
+intertwined~\cite{hof}.  Gunn~\emph{et~al.}\ suggest multi-path fetching of
+signed tree heads~\cite{gunn}, which may make persistent split-views hard
+depending on the used multi-paths.  Syta~\emph{et~al.}\ suggest that independent
+witnesses could cosign the logs using threshold signatures~\cite{syta}.
+Smaller-scale versions of witness cosigning received attention in
+industry~\cite{sigsum-witness,trustfabric-arxiv}, and generally in other types
+of transparency logs as well~\cite{parakeet}.  Larger browser vendors could
+decide to push the same signed tree heads to their users, as proposed by Sleevi
+and Messeri~\cite{sth-push}.  Paper~\ref{paper:ctga} uses the operators of
+network vantage points for aggregating and verifying signed tree heads to
+provide their users with gossip-as-a-service, however assuming plaintext
+DNS traffic and a sound signed tree head frequency as defined by
+Nordberg~\emph{et~al.}~\cite{nordberg}.  We used the multi-path assumptions of
+Gunn~\emph{et~al.}\ to break out of local vantage points.  In contrast,
+Paper~\ref{paper:ctor} ensures that the same logs are observed in the Tor
+network by incorporating signed tree heads into Tor's consensus (thus making
+directory authorities into witnesses).
+
+Li~\emph{et~al.}\ argue that it would be too costly for most domains to run a
+monitor~\cite{li}.\footnote{%
+  Whether the third-party monitors in this study misbehaved or not can be
+  questioned~\cite{ayer-on-li}.
+} Similar arguments have been raised before, and lead to alternative data
+structures that could make monitoring more efficient than today's
+overhead~\cite{vds,coniks,tomescu}.  Paper~\ref{paper:lwm} falls into this
+category, as the root of an additional static lexicographically-ordered Merkle
+tree is added to a log's signed tree heads to encode batches of included
+certificates.  The downside is that a non-deployed signed tree head extension is
+assumed~\cite{rfc9162}, as well as a tree head frequency similar to those
+described by Nordberg
+\emph{et~al.}~\cite{nordberg}~to~get~efficiency~in~practise.
+
+Paper~\ref{paper:sauteed} uses a Mozilla Firefox web extension to verify
+embedded signed certificate timestamps in Tor Browser.  Such verification is
+similar to the gradual deployments of Certificate Transparency in other
+browsers~\cite{ct-history,does-ct-break-the-web}, and the starting point to
+improve upon in Papers~\ref{paper:ctga}--\ref{paper:ctor}.  Moreover, the use of
+Certificate Transparency to associate human-meaningful domain names with
+non-mnemonic onion addresses (as in Paper~\ref{paper:sauteed}) is one of many
+proposals for alternative naming systems and onion search
+solutions~\cite{kadianakis,muffet-onions,nurmi,onion-location,h-e-securedrop,onio-ns}.
+
+\subsection{Website Fingerprinting and Side-Channels}
+
+Several researchers outline how past website fingerprinting attacks have been
+evaluated in unrealistic conditions~\cite{juarez14,perryCrit,realistic}.
+This includes not accounting for the size of the open-world setting, failing to
+keep false positive rates low enough to be useful, assuming that homepages are
+browsed one at a time, how to avoid dataset drift, and training classifiers on
+synthetic network traces.  While some of these challenges were
+addressed~\cite{onlinewf,realistic}, the question of how to deal with false
+positives remains open.  Papers~\ref{paper:cat}--\ref{paper:tlwo} make a
+significant dent in this problem by providing evidence that the website
+fingerprinting attacker model could be made \emph{stronger} to capture
+\emph{realistic real-world capabilities} that eliminate most false positives
+around Alexa top-10k and the long tail~of~unpopular~sites.
+
+Others have evaluated traffic analysis attacks against Tor beyond the website
+fingerprinting setting.  On one side of the spectrum are end-to-end
+correlation/confirmation attacks that typically consider a global passive
+attacker that observes all network
+traffic~\cite{johnson13,nasr18,oh22,rimmer22}.  Such strong attackers are not
+within the scope of Tor~\cite{tor}.  On the other side of the spectrum are local
+attackers that see a small fraction of the network, typically in a position to
+observe a user's encrypted entry traffic (Figure~\ref{fig:wf}).  Many have
+studied those \emph{weak attacks} in lab settings where, e.g., advances in deep
+learning improved the accuracy significantly~\cite{wfdef,tiktok,df}.  Others
+have focused on improved attacks that are \emph{active} in the Tor network from
+their own local vantage points~\cite{chakravarty10,mittal11,murdoch05}, which is
+similar to the techniques in Papers~\ref{paper:cat}--\ref{paper:tlwo}.
+Greschbach~\emph{et~al.}\ show that an attacker who gains access to (or traffic
+to~\cite{siby20}) commonly used DNS resolvers like Google's \texttt{8.8.8.8} get
+valuable information to improve both end-to-end correlation and website
+fingerprinting attacks~\cite{greschbach}.  Paper~\ref{paper:cat} generalizes the
+attacker capability they uncovered by allowing the attacker to query Tor's
+receiver anonymity set with a website oracle of time-frame~$t$.  It is further
+shown that it is possible to instantiate such an abstraction in the real-world
+while \emph{staying within Tor's threat model}.  In other words, the attacker is
+still local but may employ passive and active measures to narrow down the
+receiver anonymity set.  Paper~\ref{paper:ctor} proposes Certificate
+Transparency verification that gives log operators website oracle access.
+Tor's directory authorities tune $t$.
+
+Website oracles exist because Tor is designed for anonymity---not unobservable
+communication~\cite{anonterm}.  The instantiation of a real-world website oracle
+is either a direct result of observing network flows from the protocols
+used during website visits, or due to state of these network flows being stored
+and inferable.  Inferring secret system state is widely studied in applied
+cryptography and hardware
+architecture~\cite{lucky13,ge18,kocher96,mart21,tsunoo03,heist}, where the goal
+is usually to determine a key, decrypt a ciphertext, forge a message, or similar
+using side-channels.  A side-channel can be local or remote and ranges from
+analysis of power consumption to cache states and timing differences.  There is
+a long history of remote timing attacks that are
+practical~\cite{bbrumley11,dbrumley03,crosby09,wang22}.  A recent improvement in
+this area that is relevant for Tor is timeless timing attacks, which exploit
+concurrency and message reordering to eliminate network jitter~\cite{timeless}.
+Paper~\ref{paper:cat} demonstrates a remote timing attack against Tor's DNS
+cache that achieves up to 17.3\% true positive rates while minimizing false
+positives.  Paper~\ref{paper:tlwo} instead uses a remote timeless timing attack
+with no false positives, no false negatives, and a small time-frame $t$.  This
+approaches an ideal website oracle without special attacker capabilities or
+reach into third-parties.
+
+\section{Conclusions and Future Work} \label{sec:concl}
+
+Throughout the thesis, we contributed to the understanding of how trust
+requirements in Certificate Transparency can be reduced.  Efficient and reliable
+monitoring of the logs is easily overlooked.  If the broader ecosystem achieves
+monitoring through third-parties, they should be subject to the same scrutiny as
+logs.  We proposed a solution that makes it hard for third-party monitors to
+provide subscribers with selective certificate notifications.  We also proposed
+a gossip-audit model that plugs into interacting with the logs over DNS by
+having programmable packet processors verify that the same append-only logs are
+observed.  Avoiding the addition of complicated verification logic into end-user
+software is likely a long-term win because it reduces the number of moving
+parts.  In other words, simple gossip-audit models will be much easier to deploy
+in the wide variety of end-user software that embeds TLS clients.
+
+We also contributed to the understanding of how Certificate Transparency can be
+applied in the context of Tor Browser.  Compared to a regular browser, this
+results in a different setting with its own challenges and opportunities.  On
+the one hand, Tor Browser benefits from the ability to preserve privacy due to
+using the anonymity network Tor.  On the other hand, data relating to
+website visits cannot be persisted to disk (such as signed certificate
+timestamps blocked by maximum merge delays).  Our incrementally-deployable
+proposal keeps the logic in Tor Browser simple by offloading all Certificate
+Transparency verification to randomly selected Tor relays.  The design is
+complete because mis-issued certificates can eventually reach a trusted auditor
+who acts on incidents.  In addition to proposing Certificate Transparency in Tor
+Browser, we also explored how certificates with onion addresses may improve the
+association of domain names with onion addresses.  Such certificates ensure
+domain owners know which onion addresses can be discovered for their sites, much
+like Certificate Transparency does the same thing for public TLS keys.  This
+also adds censorship resistance to the discovery as logs are append-only.
+
+As part of exploring Certificate Transparency in Tor Browser, we further
+contributed to the understanding of how the protocols used during website visits
+affect unlinkability between Tor users and their destination websites.  For
+example, fetching an inclusion proof from a Certificate Transparency log is one
+such protocol.  We extended the attacker model of website fingerprinting attacks
+with website oracles that reveal whether any network user visited a website
+during a specific time frame.  Our results show that website oracles eliminate
+most false positives for all but the most frequently visited websites.  In
+addition to the theoretical evaluation of the extended attacker model, we could
+exploit (timeless) timing attacks in Tor's DNS cache to instantiate real-world
+website oracles without any special capabilities or reach into third-parties.
+This led us to contribute to the understanding of how Tor's DNS cache performs
+today, including a proposal for a performant alternative that preloads the same
+popular domains on all Tor relays to withstand all (timeless) timing attacks.
+
+As an outlook, our angle on Certificate Transparency verification has mostly
+been \emph{reactive} for end-users.  In other words, some or all certificate
+verification occurs asynchronously after a website visit.  An alternative to
+this would be upfront delivery of inclusion proofs that reconstruct tree heads
+which witnesses cosigned; a form of \emph{proactive} gossip as proposed by
+Syta~\emph{et al.}~\cite{syta}.  The significant upside is that the browser's
+verification could become non-interactive, eliminating privacy concerns and
+ensuring end-users only see certificates merged into the append-only logs.
+Investigating what the blockers for such a proposal are in
+practice---today---would be valuable as log verification quickly becomes
+complicated with signed certificate timestamps and reactive gossip-audit models.
+Are these blockers significant?  Are they significant over time as other
+\emph{eventual} changes will be needed, like post-quantum secure certificates?
+New transparency log applications are unlikely to need the complexity of
+Certificate Transparency, and should likely not copy something that was designed
+to fit into an existing system with a large amount of legacy (such as
+certificate authorities, their established processes for certificate issuance,
+and the many client-server implementations already deployed on the Internet).
+
+Orthogonal to the verification performed by end-users, contributing to the
+understanding of how domains (fail to) use Certificate Transparency for
+detecting mis-issued certificates is largely unexplored.  For example,
+subscribing to email notifications of newly issued certificates becomes less
+useful in an era where certificates are renewed frequently and automatically.
+Instead, domain owners need easy-to-use solutions that raise alarms only if
+there is a problem.
+
+Finally, the mitigation deployed to counter our (timeless) timing attacks in
+Tor's DNS cache is just that: a mitigation, not a defense, that applies to
+modestly popular websites but not the long tail where the base rate is low.
+This is because the attacker's needed website oracle time frame is so large that
+a fuzzy time-to-live value does nothing.  Practical aspects of a preloaded DNS
+cache need to be explored further before deployment, such as the assumption of a
+third-party that visits popular domains to assemble an allowlist.  We may also
+have \emph{underestimated} the utility of the existing Umbrella list, which in
+and of itself does not require any new third-party.  Does the use of Umbrella
+impact page-load latency?  Latency is the most crucial parameter to keep
+minimized.  The question is whether frequently looked-up domains are missed or
+not by skipping the website-visit step, as for the non-extended Alexa and Tranco
+lists.
+
+More broadly, the question of how to strike a balance between \emph{efficiency}
+and \emph{effectiveness} of website fingerprinting defenses is open.  How much
+overhead in terms of added latency and/or bandwidth is needed?  How much of that
+overhead is sustainable, both from a user perspective (where, e.g., latency is
+crucial for web browsing and other interactive activities) and a network health
+perspective (such as the amount of volunteered relay bandwidth that is wasted)?  It is
+paramount to neither overestimate nor underestimate attacker capabilities, which
+goes back to the still-debated threat model of website fingerprinting attacks.
+Regardless of if Tor's DNS cache becomes preloaded or not, it will be difficult
+to circumvent DNS lookups from happening.  Someone---be it a weak attacker like
+ourselves or a recursive DNS resolver at an Internet service provider---is in a
+position to narrow down the destination anonymity set.  This is especially true
+when also considering other protocols that reveal information about the
+destination anonymity set during website visits.  Accepting that sources of
+real-world website oracles are prevalent implies that \emph{the world can be
+closed}.  Therefore, a closed world is more realistic than an open world.
+
+\subsection*{Acknowledgments}
+I received valuable feedback while writing the introductory summary from
+  Simone Fischer-H\"{u}bner
+  Johan Garcia,
+  Stefan Lindskog, and
+  Tobias Pulls.
+The final draft was further improved with helpful nudges from Grammarly.
+
+\bibliographystyle{plain}
+\bibliography{src/introduction/refs}
author	Rasmus Dahlberg <rasmus@rgdd.se>	2024-10-15 16:08:16 +0200
committer	Rasmus Dahlberg <rasmus@rgdd.se>	2024-10-15 16:08:16 +0200
commit	385cc92bc91e1a6c3724085c060e76bf40c13ed3 (patch)
tree	26d0a8f81f2caa472830fd40a51844bb202c1355 /summary/src/introduction/main.tex