Computer Science

Permanent URI for this collection

This is the collection for the University of Waterloo's Cheriton School of Computer Science.

Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).

Waterloo faculty, students, and staff can contact us or visit the UWSpace guide to learn more about depositing their research.

Browse

Recent Submissions

Now showing 1 - 20 of 1591
  • Item
    Intermittent Interaction with Fabrication Devices
    (University of Waterloo, 2024-09-12) Wall, Ludwig Wilhelm
    Personal fabrication devices such as fused deposition modeling 3D printers are fairly slow, often produce lots of waste, and a fully automated fabrication process can take away the cognitive value of crafting something by hand. Interactive fabrication is instead often limited in scope or fidelity of the fabricated items as it requires constant user guidance. However, intermittent user interactions during the fabrication process are sufficient to alleviate downsides of the fabrication process, and direct user involvement can provide a similar personal connection as crafting does. This thesis explores the space of intermittent user interactions in between automated and interactive fabrication. An initial investigation enables physical user interactions during ongoing fabrication processes to reduce internal support material waste. A study with fabrication experts reveals sources of waste and a desire to intervene to take sustainable action. To address external support material as an additional source of waste we investigate more complex or more frequent user interactions. A technical evaluation quantifies the trade-off between increased user involvement and material savings. We investigate frequent interaction during fabrication by enabling sewing on unmodified 3D printers. Our technical evaluation situates the design space of fabricated seams next to traditional stitch patterns. We then examine the spectrum of user involvement during fabrication as a whole through parametric partial assembly tasks. The qualitative analysis of our survey reveal the perceived value of intermittent interaction, scheduling preferences, and design guidelines how systems can offer effective manual interaction options.
  • Item
    Towards Standardized Evaluation in Differentially Private Image Classification: A Critical Approach
    (University of Waterloo, 2024-09-10) Kodeiri, Sara
    This thesis critically examines differentially private machine learning (DPML) in image classification, addressing recent critiques about the effectiveness of current techniques and the validity of existing benchmarks. We focus on three key questions: the accuracy of current benchmarks in measuring DPML progress, the impact of public pre-training datasets on DPML model performance, and strategies for unifying future research efforts. Our study introduces standardized benchmark datasets and evaluation settings using two medical image datasets as private data sources. We assess various DPML methods across different scenarios, including those with no public data, training from scratch, and fine-tuning approaches. The main contributions include: proposing standardized benchmark datasets and evaluation settings, conducting a validation study of previous DPML techniques, and introducing a moderated public leaderboard to track progress in DPML. This research aims to provide a comprehensive assessment of DPML in image classification, offering insights into existing methods and suggesting future research directions in machine learning and privacy.
  • Item
    Asymmetric Clustering in Federated Continual Learning
    (University of Waterloo, 2024-09-09) Zhang, Zehao
    Asymmetric clustering represents a critical yet under-explored challenge in Clustered Federated Learning (CFL). Existing methods often compromise data utilization or model accuracy by either separating devices with different data quality into distinct clusters or merging all devices into a single cluster. The need for asymmetric clustering arises in practical scenarios where not all devices contribute equally due to varying data quality or quantity. For example, in healthcare, devices at a research hospital might generate high-quality medical imaging data compared to a small clinic. Asymmetric clustering allows high-quality data sources to enhance the learning of models on devices with lower-quality data without the need for reciprocity, which is crucial in such imbalanced environments. We introduce a novel federated learning technique that enables selective contributions from some devices to others' model training without requiring equal give-and-take. Crucially, our approach excels in the Federated Continual Learning (FCL) setting by addressing temporal heterogeneity and concept drift through its ensemble features. Through detailed empirical evaluations, we validate that our approach not only efficiently generates high-quality asymmetric clustering but also significantly enhances performance in continual learning settings. This adaptability makes it highly suitable for real-world applications where data distributions are not static but evolve over time.
  • Item
    Policy Learning under Uncertainty and Risk
    (University of Waterloo, 2024-08-30) Luo, Yudong
    Recent years have seen a rapid growth of reinforcement learning (RL) research. In year 2015, deep RL achieved superhuman performance in Atari video games. In year 2016, the Alpha Go developed by Google DeepMind beat Lee Sedol, one of the top Go players in South Korea. In year 2022, OpenAI released ChatGPT 3.5, a powerful large language model, which is fine-tuned by RL algorithms. Traditional RL considers the problem that an agent interacts with an environment to acquire a good policy. The performance of the policy is usually evaluated by the expected value of total discounted rewards (or called return) collected in the environment. However, the mostly studied domains (including the three mentioned above) are usually deterministic or contain less randomness. In many real world applications, the domains are highly stochastic, thus agents need to perform decision making under uncertainty. Due to the randomness of the environment, another natural consideration is to minimize the risk, since only maximizing the expected return may not be sufficient. For instance, we want to avoid huge financial loss in portfolio management, which motivates the mean variance trade off. In this thesis, we focus on the problem of policy learning under uncertainty and risk. This requires the agent to quantify the intrinsic uncertainty of the environment and be risk-averse in specific cases, instead of only caring for the mean of the return. To quantify the intrinsic uncertainty, in this thesis, we stick to the distributional RL method. Due to the stochasticity of the environment dynamic and also stochastic polices, the future return that an agent can get at a state is naturally a random variable. Distributional RL aims to learn the full value distribution of this random variable. Usually, the value distribution is represented by its quantile function. However, the quantile functions learned by existing algorithms suffer from limited representation ability or quantile crossing issue, which is shown to hinder policy learning and exploration. We propose a new learning algorithm to directly learn a monotonic, smooth, and continuous quantile representation, which provides much flexibility for value distribution learning in distributional RL. For risk-averse policy learning, we study two common types of risk measure, i.e., measure of variability, e.g., variance, and tail risk measure, e.g., conditional value at risk (CVaR). 1) Mean variance trade off is a classic yet popular problem in RL. Traditional methods directly restrict the total return variance. Recent methods restrict the per-step reward variance as a proxy. We thoroughly examine the limitations of these variance-based methods in the policy gradient approach, and propose to use an alternative measure of variability, Gini deviation, as a substitute. We study various properties of this new risk measure and derive a policy gradient algorithm to minimize it. 2) CVaR is another popular risk measure for risk-averse RL. However, RL algorithms utilizing policy gradients to optimize CVaR face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we start from an insight that in many scenarios, the risk-averse behavior is only required in a subset of states, and propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus the sample efficiency is significantly improved.
  • Item
    SCALING PRE-TRAINING DATA & LANGUAGE MODELS FOR AFRICAN LANGUAGES
    (University of Waterloo, 2024-08-23) Oladipo, Akintunde
    Recent advancements in language models, particularly for high-resource languages, have not been paralleled in low-resource languages spoken across Africa. This thesis addresses this gap by scaling pre-training data and developing improved language models for African languages. We introduce Wura, a high-quality, document-level pre-training dataset encompassing 16 African languages along with four high-resource languages commonly spoken on the continent: Arabic, English, French, and Portuguese. Leveraging Wura, we pre-train new versions of the AfriBERTa (encoder-only) and AfriTeVa (encoder-decoder) model families. These new models demonstrate superior performance across a variety of natural language understanding and generation tasks compared to existing baselines. Notably, AfriTeVa V2 Large (1B) stands as the largest sequence-to-sequence model pre-trained for African languages to date. Our methodology includes a meticulous three-stage curation process for Wura --- auditing and filtering existing web crawls, initiating new web crawls, and integrating existing language resources. The experimental setup and evaluation encompass tasks like text classification, information retrieval, translation, summarization, and cross-lingual question answering. Our new models outperform their predecessors and other established models, even those with significantly more parameters, highlighting the efficacy of high-quality pre-training data. Furthermore, we study the generalization of our models to languages not deliberately included in their pre-training data.
  • Item
    Efficient Machine Learning Methods for Solving Hamilton-Jacobi-Bellman Equations in Finance
    (University of Waterloo, 2024-08-23) Na, Andrew
    Recent developments in machine learning have allowed for the solution of high-dimensional partial differential equations overcoming the curse of dimensionality. In our work, we utilize deep neural networks to our advantage to solve Hamilton-Jacobi-Bellman (HJB) equations that arise in financial applications efficiently. Two prevalent methods are commonly used to solve high-dimensional HJB equations. One method uses physics-based neural networks to approximate the unknown solution while regularizing the solution with boundary conditions. Another method, and the focus of this work, represents the HJB equation as a backward-stochastic differential equation (BSDE). This allows us to solve the HJB equation using a reinforcement learning framework. A major bottleneck is present in both methods. Both methods require a neural network to be trained at each timestep. This means that a new network with large number of parameters need to be stored and trained at each timestep which requires a lot of memory, time and energy. In this work we propose different methods to solve the HJB equation in specific financial applications. To do this, we design specific neural network architectures that can fit the problem and derive loss functions to allow us to train efficient networks. In this work we explore four different problems. Firstly, we solve the HJB equation that arises in high-dimensional American options pricing. We present a method that uses two separate neural network that learns to approximate the price and delta. We improve efficiency by reducing the number of neural networks that we need to train. Secondly, we extend and solve the optimal trade execution problem for multiple agents and assets under the Markowitz criteria in a time-consistent way. Lastly, we solve the inverse problem with minimal arbitrage violations. The computation of implied volatilities from data is a challenging task that is necessary to accurately compute the price and hedging positions of over-the-counter (OTC) products.
  • Item
    An Efficient Non-Interactive OT-MP-PSI Protocol for Multi Institution Attack Detection
    (University of Waterloo, 2024-08-22) Arpaci, Onur Eren
    The Over Threshold Multiparty Private Set Intersection (OT-MP-PSI) problem is a cryptographic problem where N participants aim to identify elements that appear in at least t sets without revealing information about elements that do not meet this threshold. This protocol is particularly relevant for detecting multi-institution attacks, which require private and efficient information sharing. In order to detect the attackers, the institutions want to identify the common external IPs among other institutions without leaking the rest of the connection data. Existing solutions either require numerous communication rounds or have impractical computational complexities for large datasets. We introduce a non-interactive OT-MP-PSI protocol optimized by a novel hashing scheme with a computational complexity of O(t^2 M (N choose t)) as compared to the O(M(N log(M)/t)^{2t}) computational complexity of previous existing solutions. Experimental benchmarks using real-world network data demonstrate the protocol's practicality, showing significantly faster reconstruction times compared to existing methods and effective scalability with the number of participants and set sizes. Our protocol also efficiently solves the two-party private set intersection (N=t=2) and the multi-party private set intersection (N>2, t=N), offering a versatile solution for various applications.
  • Item
    Pretrained Transformers for Efficient and Robust Information Retrieval
    (University of Waterloo, 2024-08-22) Li, Minghan
    Pretrained transformers have significantly advanced the field of information retrieval (IR) since the introduction of BERT. Through unsupervised pretraining followed by task-aware finetuning, the encoder-only BERT effectively captures the complex semantics of texts and overcomes the limitations of traditional bag-of-words systems like BM25, addressing the lexical mismatch between queries and documents. The successor models based on decoder-only pretrained transformers such as GPT and LLaMA have further improved the ranking effectiveness by scaling up the computation. Undoubtedly, these pretrained transformers have made a profound impact on search and ranking, completely changing the landscape of modern IR systems. Consequently, pretrained transformers are now applied in various stages of a modern IR system, typically comprising a first-stage retriever, a second-stage reranker, and an optional large language model (LLM). For instance, the first-stage retriever can be a bi-encoder that separately encodes the queries and documents, and the encoded collection of documents is indexed into a vector database for nearest neighbor search. Given a query, the first-stage retriever selects the top-k similar documents from the vector database for the second-stage reranker to review. The following second-stage reranker is usually a cross-encoder which evaluates the query and documents jointly, further refining the search results. The final step is to use an LLM to summarize the retrieved information before returning it to the users, which is optional but has become more and more popular in modern search engines. However, pretrained transformers are not without flaws. Compared to traditional IR methods, the challenges of efficiency and robustness are frequently encountered in practice. For example, first-stage retrievers and their vector indexes often suffer from redundancy, leading to increased search latency and excessive storage requirements. The second-stage rerankers, based on cross-encoder pretrained transformers, also incur high computational costs when reranking a large set of retrieved documents. In addition, the representations of pretrained transformers are usually lossy compression of the inputs and tied to certain data distributions, making them susceptible to out-of-domain queries and subtle details such as dates and locations. These issues can further impair the downstream LLM, hindering the efficient extraction of useful information from the retrieved documents and promoting hallucinations when processing biased content retrieved by adversarial queries. Given the breadth of the challenges outlined, this thesis will delineate a specific scope of study for efficiency and robustness. Regarding efficiency, our focus will be on minimizing search latency and storage costs during the first-stage retrieval and second-stage reranking processes, along with reducing the generation latency of the downstream LLM. Regarding robustness, we will concentrate on the reliability of both retrievers and rerankers in handling out-of-domain queries, as well as ensuring the factuality and attribution accuracy in the output of the retrieval-augmented LLM. Under this context, we outline our efforts to address the efficiency and robustness challenges in major components of a modern IR system based on pretrained transformers, ranging from first-stage retrieval—including dense, sparse, hybrid, and multi-vector retrievers—to second-stage cross-encoder reranking, as well as tackling hallucination and attribution issues in downstream LLMs from a retrieval standpoint. For first-stage dense retrieval models, we explore index compression to reduce search latency and storage space through dimensionality reduction and product quantization. We also detail our work on improving the robustness of these models through techniques like model ensembling and integration with sparse retrieval methods. In extending these ideas to multi-vector retrieval systems, we employ dynamic lexical routing and token pruning to jointly optimize for efficiency and robustness during finetuning on ranking tasks. Additionally, we examine the integration of sparse retrieval in a multi-vector system and propose sparsified late interactions for efficient retrieval, ensuring compatibility with traditional inverted indexes and improving latency. In the later sections, we continue to address the efficiency and robustness challenges in second-stage rerankers based on cross-encoders. We introduce a straightforward strategy for cross-encoder rerankers by adding late interaction at the last layer to handle out-of-domain, long-sequence data with minimum latency overhead. We also assess the impact of using LLMs in query expansion for high-precision cross-encoder rerankers in conventional ranking tasks, finding that traditional query expansion methods can undermine the robustness of effective rerankers. After integrating first-stage retrieval with second-stage reranking, we propose a candidate-set pruning method with high-confidence error control to accelerate the reranking process reliably, allowing users to specify their precision goals while maintaining efficiency with theoretical guarantees. Finally, we use these optimized two-stage retrieval systems to address hallucination and attribution problems in downstream retrieval-augmented LLMs with improved efficiency for generation. We introduce a hybrid generation model that incorporates segments retrieved from the corpus directly into LLM outputs to reduce hallucinations and improve attribution in knowledge-intensive tasks. Our final system leverages our efficient and robust solutions in neural retrieval, delivering precise responses swiftly and accelerating content generation by processing multiple tokens per time step, with a post-hoc revision mechanism to balance robustness and efficiency. Overall, this thesis aims to improve the efficiency and robustness of key components in a contemporary IR system and their integration with downstream LLMs.
  • Item
    ClavaDDPM: Multi-relational Data Synthesis with Cluster-guided Diffusion Models
    (University of Waterloo, 2024-08-22) Pang, Wei
    Recent research in tabular data synthesis has focused on single tables, whereas real-world applications often involve complex data with tens or hundreds of interconnected tables. Previous approaches to synthesizing multi-relational (multi-table) data fall short in two key aspects: scalability for larger datasets and capturing long-range dependencies, such as correlations between attributes spread across different tables. Inspired by the success of diffusion models in tabular data modeling, we introduce Cluster Latent Variable guided Denoising Diffusion Probabilistic Models (ClavaDDPM). This novel approach leverages clustering labels as intermediaries to model relationships between tables, specifically focusing on foreign key constraints. ClavaDDPM leverages the robust generation capabilities of diffusion models while incorporating efficient algorithms to propagate the learned latent variables across tables. This enables ClavaDDPM to capture long-range dependencies effectively. Extensive evaluations on multi-table datasets of varying sizes show that ClavaDDPM significantly outperforms existing methods for these long-range dependencies while remaining competitive on utility metrics for single-table data.
  • Item
    Evaluating Privacy Metrics for Synthetic Tabular Data
    (University of Waterloo, 2024-08-22) Mushi , Wang
    This paper addresses the challenge of evaluating privacy risks in synthetic tabular data by examining black-box privacy metrics that do not require detailed knowledge of the data generation process. We focus on two sorts of attacks, black-box and white-box attacks. Utilizing six datasets from the UCI Machine Learning Repository, we evaluate the effectiveness of these metrics across various synthetic data generation models, including diffusion models like TabDDPM and traditional models like PrivBayes. Our findings reveal that while DOMIAS exhibits limited sensitivity across different datasets and configurations, DCR proves to be an effective measure of similarity between synthetic and real data, offering significant insights into privacy preservation. We also introduce the Step-wise Error Comparing Membership Inference (SECMI) attack, which assesses prediction errors at each generation step to infer membership status. The study concludes that diffusion models, such as TabDDPM, generally achieve a superior balance of utility and privacy compared to traditional models. These results highlight the need for robust, adaptable privacy metrics to reliably assess privacy risks in synthetic data, thereby ensuring its safe application across various domains.
  • Item
    Uncovering the Reliability and Consistency of AI Language Models: A Systematic Study
    (University of Waterloo, 2024-08-22) Khatun, Aisha
    Large Language Models (LLMs) have rapidly advanced, becoming general-purpose assistants and creative partners. Despite their widespread use, LLMs exhibit significant vulnerabilities to prompt variations and struggle with task understanding, leading to inconsistencies and factual inaccuracies in their responses. Traditional Natural Language Processing (NLP) benchmarks often overlook nuances in LLM behavior and reliability. This thesis addresses this gap by curating a dataset across six categories: Fact, Conspiracy, Controversy, Misconception, Stereotype, and Fiction. We rigorously define LLMs' factual accuracy, consistency, and robustness to prompt variations using diverse response formats and question variations, and evaluate these on 37 models. Our findings reveal LLMs' volatility and unreliability, particularly in the Controversy and Misconception categories, where conflicting training data impedes performance. Additionally, we explore LLMs' ability to generate coherent fictional narratives, probing their ability to retain and effectively utilize factual information, a critical requirement for creative tasks like story generation. While LLMs offer versatile applications, their reliability hinges on addressing challenges in prompt understanding and response consistency, emphasizing the need for ongoing research to enhance their performance across diverse tasks and applications.
  • Item
    Privacy-Preserving Communications from Privacy-Preserving Computations
    (University of Waterloo, 2024-08-21) Sasy, Sajin
    Our society is now more than ever heavily dependent on digital data and electronic communications. Every day more of our daily activities move to the digital realm, further entrenching us in this dependency. Unfortunately, along with this transition, we are witnessing a steady increase in cyber attacks and data leakages that undermine the security and privacy of our digital data. While the majority of online communications today use end-to-end encryption to protect data in transit, the metadata of who is conversing with whom, when, how much, and how often remains unprotected. Such metadata by itself can often reveal a lot about an individual to malicious network adversaries. In the context of digital services, while our personal data is protected in transit via end-to-end encryption, they are eventually stored in the clear in large data centers for processing and extracting utility. Such data centers have hence turned into coveted attack targets for malicious actors. Consequently there is a surge in research towards privacy-preserving computations over data, wherein the underlying data remains protected from malicious actors that could gain unauthorized access to a data center. One approach towards such privacy-preserving computations that has gained practical traction is Trusted Execution Environments (or TEEs). However, TEEs introduce their own set of challenges. Research has demonstrated TEEs’ susceptibility to side-channel attacks that undermine their purported security and privacy guarantees. In this dissertation, we start by establishing the notion of ‘fully oblivious’ algorithms. By definition such algorithms do not leak any information about the underlying private data they operate over via such side channels. We then provide several fully oblivious algorithms for the fundamental and recurring data operations of compacting, shuffling, and sorting of data. Oblivious compaction, shuffling, and sorting are core building blocks underlying several privacy-preserving solutions today, and are often the expensive bottleneck of such solutions. Our new algorithms improve over the state-of-the-art approaches for these primitives in some combination of their asymptotics, underlying constants, and parallelizability, and with demonstrable improvements in concrete performance always, enabling efficient and secure deployments of privacy-preserving computations. Finally, we propose a novel design for a metadata-protecting communication system (MPCS) TEEMS designed to facilitate metadata-protected messaging. While there has been tremendous research towards novel MPCS designs, there has been a clear dearth of MPCS constructions that can simultaneously support (i) low latency, (ii) high throughput, (iii) horizontal scalability, and (iv) asynchronicity. TEEMS presents an MPCS that can simultaneously support all four desirable properties for messaging systems by leveraging our new efficient fully oblivious algorithms as building blocks.
  • Item
    Mitigating Signalling Storms in 5G
    (University of Waterloo, 2024-08-16) Zhang, Bohan
    Over the past 45 years, cellular networks have evolved from a luxury service for the few to an indispensable necessity for everyone, giving rise to countless new applications. This remarkable technological advancement has fundamentally transformed everyone’s life. The latest fifth generation (5G) promises unprecedented performance in terms of latency, throughput, and subscriber density, as well as groundbreaking new services like autonomous driving, remote surgery, smart cities, and ubiquitous coverage. However, numerous challenges exist in achieving these promises, with one of the most significant being the vulnerability to signalling storm. A signalling storm occurs when the volume of control signals exceeds the network’s processing capacity, leading to service disruptions. During these disruptions, retries from the users and network entities amplify the problem, creating a snowball effect. This persistent threat, present since the era of the third generation (3G), remains a major concern in 5G networks. Over the past decade, signalling storms have caused severe outages worldwide, affecting millions of lives, resulting in financial losses amounting to millions of dollars, and millions of subscriber hours lost. With the transition from the fourth generation (4G) to 5G, network elements have been virtualized and assigned more granular tasks, significantly increasing their numbers and the number of signals required to complete system procedures. From the subscriber’s perspective, the enriched diversity of application scenarios and the increase in subscriber numbers have exponentially raised the complexity of network resource management and orchestration. Notably, the massive number of globally roaming cellular IoT devices with limited security features are susceptible to being compromised to execute low-cost Distributed Denial-of-Service (DDoS) attacks on operators worldwide. Due to the significant losses in the past, governments, operators, and academia consider signalling storms a major threat to 5G. This thesis investigates past cases and numerous industry white papers, research articles, and 3GPP standards, summarizing the causes, scenarios, and mitigation solutions for signalling storms in 5G. Following these investigations, we propose a blockchain-assisted 5G Authentication and Key Agreement protocol to mitigate registration signalling storms. We also introduce a secure and efficient group handover protocol to address handover signalling storms in 5G Low Earth Orbit satellite non-terrestrial networks (NTN). Furthermore, we design a signalling load-aware conditional handover to reduce signalling peak in 5G NTN. Network capacity has always been a driving force behind the evolution of cellular networks. We hope this work will shed light on the future design of 5G and beyond cellular networks.
  • Item
    Raising the Bar on Lowering Barriers: Improving Ease of Research and Development Contributions to Privacy Enhancing Technologies
    (University of Waterloo, 2024-08-14) Tracey, Justin
    As daily life becomes increasingly subject to surveillance economies and surveillance states, so do privacy enhancing technologies (PETs) become increasingly important to those who wish to live unmediated by these mass data collection practices. While improvements to the design and implementation of PETs have allowed more people than ever to take advantage of these tools, the research and development practices surrounding PETs require the work of domain experts—with their own biases, values, motivations, and awareness—doing their best to accommodate the needs and desires of users. As a result, an inevitable gap forms between what is built by those who have the skills and resources to develop these technologies, and what is needed by those who can only use what is available. In this thesis, we examine techniques for lowering barriers to entry for research and development contributions to PETs, so that users who wish to make such contributions are more readily able to do so. Towards this end, we use as a concrete example the Tor project, and how it interfaces with current research and development practices to demonstrate three methods of lowering such barriers: (i) a set of tools and techniques for conducting statistically sound Tor experiments, (ii) an analysis of the viability of Tor as a means of providing simple metadata protection in messaging apps, and (iii) an investigation into the effect on new contributors when porting a codebase written in a programming language without memory safety to the memory-safe Rust language, as Tor is doing. We find that, using these methods, the barriers to entry can be reduced, but that considerable future work still remains in this vein.
  • Item
    Neural Architecture Search for Website Fingerprinting
    (University of Waterloo, 2024-08-07) Arun Naik, Shreya
    To protect their online privacy, users often employ encrypted tunnels from networks like Tor, which create multi-hop pathways to conceal communications. Tor’s design, while effective, preserves packet timing and volume, making it vulnerable to website fingerprinting (WF) attacks. In these attacks, eavesdroppers use machine learning to match traffic patterns to specific websites. Recent WF research has shifted from traditional machine learning to deep neural networks (DNNs) for more precise attacks. This shift has led to a reliance on trial-and-error adaptations from other domains, potentially overlooking innovative architectural choices. To address this challenge, this thesis introduces FAWNS, a NAS framework for website fingerprinting and traffic analysis, integrated with Microsoft’s NNI AutoML toolkit. FAWNS automates DNN design using a predefined search space. Consequently, our evaluation shows that FAWNS-generated DNNs achieve accuracy comparable to state-of-the-art attacks on undefended Tor traffic and significantly higher accuracy on defended traffic, with a 26% increase on FRONT-protected traces and a 12% increase on WTF-PAD protected traces. These models also generalize well to unseen data, highlighting NAS's potential to enhance WF effectiveness.
  • Item
    A Comparison of Unsupervised Topic Modelling Techniques for Qualitative Data Analysis of Online Communities
    (University of Waterloo, 2024-07-25) Kaur, Amandeep
    Social media constitutes a rich and influential source of information for qualitative researchers. However, its vast volume and diversity present significant challenges, which can be assisted by computational techniques like topic modelling. But qualitative researchers often struggle to use computational techniques due to a lack of programming expertise and concerns about maintaining the nuanced aspects of their research, such as contextual understanding, subjective interpretations, and ethical considerations of their data. To address this issue, this thesis explores the integration of BERTopic, an advanced Large Language Model (LLM)-based method, into the Computational Thematic Analysis (CTA) Toolkit to support qualitative data analysis of social media. We conducted interviews and hands-on evaluations in which qualitative researchers compared topics from three modeling techniques --- LDA, NMF and BERTopic. Participants prioritized topic relevance, logical organization, and the capacity to reveal unexpected relationships within the data, valuing detailed, coherent clusters for deeper understanding and actionable insights. BERTopic was favored by 8/12 participants for its ability to uncover hidden connections. These findings underscore the transformative potential of LLM-based tools in providing deeper, more nuanced insights for qualitative analysis of social media data.
  • Item
    Efficient Memory Allocator for Restricting Use-After-Free Exploitations
    (University of Waterloo, 2024-07-17) Wang, Ruizhe
    Attacks on heap memory, encompassing memory overflow, double and invalid free, use-after-free (UAF), and various heap-spraying techniques are ever-increasing. Existing secure memory allocators can be generally classified as complete UAF-mitigating allocators that focus on detecting and stopping UAF attacks, type-based allocators that limit type confusion, and entropy-based allocators that provide statistical defenses against virtually all of these attack vectors. In this thesis, I introduce two novel approaches, SEMalloc and S2Malloc, for type- and entropy-based allocation, respectively. Both allocators are designed to restrict, but not to fully eliminate, the attacker's ability, using allocation strategies. They can significantly increase the security level without introducing excessive overheads. SEMalloc proposes a new notion of thread-, context-, and flow-sensitive 'type', SemaType, to capture the semantics and prototype a SemaType-based allocator that aims for the best trade-off amongst the impossible trinity. In SEMalloc, only heap objects allocated from the same call site and via the same function call stack can possibly share a virtual memory address, which effectively stops type-confusion attacks and make UAF vulnerabilities harder to exploit. S2Malloc aims to enhance UAF-attempt detection without compromising other security guarantees or introducing significant overhead. We use three innovative constructs in secure allocator design: free block canaries (FBC) to detect UAF attempts, random in-block offset (RIO) to stop the attacker from accurately overwriting the victim object, and random bag layout (RBL) to impede attackers from estimating the block size based on its address. This thesis demonstrates the importance of memory security and highlights the potential of more secure and efficient memory allocation by constraining attacker actions.
  • Item
    Technology Design Recommendations Informed by Observations of Videos of Popular Musicians Teaching and Learning Songs by Ear
    (University of Waterloo, 2024-07-11) Liscio, Christopher
    Instrumentalists who play popular music often learn songs by ear, using recordings in lieu of sheet music or tablature. This practice was made possible by technology that allows musicians to control playback events. Until now, researchers have not studied the human-recording interactions of musicians attempting to learn pop songs by ear. Through a pair of studies analyzing the content of online videos from YouTube, we generate hypotheses and seek a better understanding of by-ear learning from a recording. Combined with results from neuroscience studies of tonal working memory and aural imagery, our findings reveal a model of by-ear learning that highlights note-finding as a core activity. Using what we learned, we discuss opportunities for designers to create a set of novel human-recording interactions, and to provide assistive technology for those who lack the baseline skills to engage in the foundational note-finding activity.
  • Item
    Quantum Query Complexity of Hypergraph Search Problems
    (University of Waterloo, 2024-07-09) Yu, Zhiying
    In the study of quantum query complexity, it is natural to study the problems of finding triangles and spanning trees in a simple graph. Over the past decades, many techniques are developed for finding the upper and lower quantum query bounds of these graph problems. We can generalize these problems to detecting certain properties of higher rank hypergraphs and ask whether these techniques are still available. In this thesis, we will see that when the rank increase, complexity bounds still holds for some problems, although less effectively. For some other problems, their nontrivial complexity bounds vanish. Moreover, we will focused on using the generalized adversary and learning graph techniques for finding nontrivial quantum query bounds for different hypergraph search problems. The following results are presented. • Discover a general quantum query lower bound for subhypergraph-closed properties and monotone properties over r-partite r-uniform hypergraphs. • Provide tight quantum query bounds for the connectivity and acyclicity problems over r-uniform hypergraphs. • Present a nontrivial learning graph algorithm for the 3-simplex finding problem. • Formulate nested quantum walk in the adaptive learning context and use it to present a nontrivial quantum query algorithm for the 4-simplex finding problem. • Present a natural relationship of lower bounds for simplex finding of different ranks. • Use the learning graph formalization of tetrahedron certificate structure to find a nontrivial quantum query lower bound of the 3-simplex sum problem.
  • Item
    Triangle count estimation and label prediction over uncertain streaming graphs
    (University of Waterloo, 2024-07-09) Mohanty, Ipsita
    This thesis aims to integrate the notions of uncertainty with graph stream process- ing, presenting probabilistic models to enhance real-time analytical capabilities in graph database systems. These systems are crucial for managing interconnected data in various domains, such as social networks, traffic networks, and genomic databases, where data often contains incomplete or probabilistic connections that complicate processing and analysis. We develop and validate two main methodologies: a martingale-based approach for approximating triangle counts in edge uncertain streaming graphs and a Graph Neural Network (GNN)-based method for dynamic label prediction in attribute uncertain stream- ing graphs. Both methods demonstrate robust performance in handling dynamic and uncertain data, thus opening new avenues for future research in expanding the scope of graph-based analytics. This work lays the groundwork for future developments in uncer- tain graph processing, suggesting pathways to refine these approaches and explore new applications in dynamic environments.