Systems Design Engineering

Permanent URI for this collectionhttps://uwspace.uwaterloo.ca/handle/10012/9914

This is the collection for the University of Waterloo's Department of Systems Design Engineering.

Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).

Waterloo faculty, students, and staff can contact us or visit the UWSpace guide to learn more about depositing their research.

Browse

Now showing 1 - 20 of 816

Designing for Trust: A Multi-Factor Investigation of Optometrists’ Perspectives on AI-Based Glaucoma Screening Systems
(University of Waterloo, 2025-11-07) Karim, Ali
Although glaucoma screening AI models show strong performance, their integration into clinical practice remains limited. Clinicians often face barriers rooted in technological acceptance, with trust emerging as a key determinant of adoption. Prior research has emphasized explainability, but a broader exploration of factors affecting trust is needed. This study investigates multiple factors shaping trust in AI and translates them into design requirements for next-generation glaucoma screening clinical decision support systems (CDSS). In a previous study, two real-world glaucoma patient cases, each comprising three visits at different times, were presented under both unimodal conditions (fundus images only) and multimodal conditions (fundus images, optical coherence tomography, visual fields, and medical history) through a mock interface simulating an AI-based glaucoma screening support system. During these simulated visits, nineteen licensed optometrists interacted with the system and participated in follow-up interviews, where they were asked whether they trusted the system and to explain their reasoning. The objective of this thesis is to identify the factors influencing optometrists’ trust in an AI-powered glaucoma screening tool and to propose design recommendations that can enhance trust in future iterations. The interview data were analyzed using Braun and Clarke’s thematic analysis approach. The emerging themes indicate that trust in the AI system is shaped by multiple factors: (1) alignment with clinicians’ expectations of AI’s role: flagging tool vs. consultant; (2) completeness of information; (3) communications of performance metrics: accuracy, sensitivity, confidence scores, perceived consistency and perceived quality of training data (4) clinical relevance of outputs (trends, actionable recommendations, differential diagnosis); (5) transparency in risk factor weighting, exclusions, and considered variables; (6) decision alignment between optometrists and the AI, assessed across decision inputs, identified risk factors, their relative importance, recommended actions, and the gradient of concordance in final decisions; (7) optimized the AI for cautious screening to captures all potential cases; (8) interface usability supporting timely decisions; (9) users’ self-perceived expertise, occasionally leading to overreliance; (10) onboarding and training that highlighted the system’s features and limitations; and (11) increasing familiarity over time, which helped calibrate trust. Based on these findings, 17 design principles were proposed to guide the development of the next iteration of a trust-supportive interface for glaucoma screening decision support systems.
Manifold-Aware Regularization for Self-Supervised Representation Learning
(University of Waterloo, 2025-11-04) Sepanj, Mohammad Hadi
Self-supervised learning (SSL) has emerged as a dominant paradigm for representation learning, yet much of its recent progress has been guided by empirical heuristics rather than unifying theoretical principles. This thesis advances the understanding of SSL by framing representation learning as a problem of geometry preservation on the data manifold, where the objective is to shape embedding spaces that respect intrinsic structure while remaining discriminative for downstream tasks. We develop a suite of methods—ranging from optimal transport–regularized contrastive learning (SinSim) to kernelized variance–invariance–covariance regularization (Kernel VICReg)—that systematically move beyond the Euclidean metric paradigm toward geometry-adaptive distances and statistical dependency measures, such as maximum mean discrepancy (MMD) and Hilbert–Schmidt independence criterion (HSIC). Our contributions span both theory and practice. Theoretically, we unify contrastive and non-contrastive SSL objectives under a manifold-aware regularization framework, revealing deep connections between dependency reduction, spectral geometry, and invariance principles. We also challenge the pervasive assumption that Euclidean distance is the canonical measure for alignment, showing that embedding metrics are themselves learnable design choices whose compatibility with the manifold geometry critically affects representation quality. Practically, we validate our framework across diverse domains—including natural images and structured scientific data—demonstrating improvements in downstream generalization, robustness to distribution shift, and stability under limited augmentations. By integrating geometric priors, kernel methods, and distributional alignment into SSL, this work reframes representation learning as a principled interaction between statistical dependence control and manifold geometry. The thesis concludes by identifying open theoretical questions at the intersection of Riemannian geometry, kernel theory, and self-supervised objectives, outlining a research agenda for the next generation of geometry-aware foundation models.
Towards a Novel Optical Spectroscopy Technique Using Photon Absorption Remote Sensing
(University of Waterloo, 2025-11-04) Dhillon, Jodh
Optical spectroscopy has shown great promise in the field of biomedical research. For example, works employing traditional spectroscopy approaches have demonstrated that analyzing a sample’s optical response to incoming light can effectively differentiate between healthy and diseased tissue. However, these techniques suffer from limitations due to the fact that they typically capture signals from only a single light-matter interaction type, such as absorption, scattering or fluorescence. Therefore, many traditional methods are constrained in terms of the types of samples they can feasibly analyze, as well as, potentially, the depth of their sample characterization, as they do not focus on capturing relevant information from other interaction modalities. This work employs photon absorption remote sensing (PARS) to overcome these limitations. PARS is a novel all-optical imaging technique capable of capturing radiative and non-radiative relaxation processes following electronic photon absorption. This thesis explores the initial development of the first PARS system specifically designed and optimized for optical spectroscopy applications, aimed at studying wavelength-dependent relaxation processes to characterize a wide range of liquid samples. The first step of this work was to build a non-radiative PARS spectroscopy system capable of accurately capturing the thermal and acoustic relaxation processes that arise from different ultra-violet (UV) excitation wavelengths. These signals were processed and used to construct a non-radiative PARS absorption spectrum for each sample of interest. These spectra were benchmarked against the absorption data collected from a NanoDrop spectrophotometer, which served as the ground truth in this work. This study revealed that for certain samples, such as eumelanin, which is highly absorbent to UV light and relaxes almost all absorbed energy non-radiatively, the non-radiative PARS spectroscopy system is capable of generating highly accurate absorption spectra. However, this system did not generate as close to ground truth spectra for samples that do not have as strong UV absorbing tendencies and are not as non-radiative in nature. The second step of this work was to integrate a radiative relaxation arm into the developed non-radiative PARS spectroscopy system. This pathway was configured to collect fluorescence emission spectra, which represent radiative sample relaxation, simultaneously with the collected non-radiative data. Radiative PARS absorption spectra were generated for each sample. In this way, the developed PARS system combines absorption (monitoring both relaxation pathways) and fluorescence emission spectroscopy onto a single bench-top system. The radiative PARS absorption spectra were compared to the ground truth, which revealed that molecules that are highly fluorescent in nature are more appropriately studied through the radiative relaxation arm than the non-radiative pathway. Total absorption spectra, which combine the non-radiative and radiative absorption data, were also generated, and it was determined that the absorption profiles of certain samples, such as NADH, are best studied using this approach. The final step of this work was to use the collected total absorption and fluorescence emission data from the PARS spectroscopy system to identify the composition of different mixtures of craft red and blue ink samples. Traditional linear and generalized bilinear models were employed to perform this unmixing and the results from this study indicate that the combination of the absorption and fluorescence data collected on this system allows for a more accurate identification of a mixture’s components than either data source individually. This suggests that the PARS spectroscopy system provides an increased level of detail in sample characterization compared single-modality spectroscopy systems. Ultimately, this research lays the groundwork for the development of a PARS spectroscopy system capable of being deployed in clinical settings to study samples and help inform diagnoses. This work demonstrates the feasibility of leveraging PARS for optical spectroscopy and presents a system design and framework that can be further iterated upon to enhance performance and enable a robust characterization of relevant and complex biological samples.
Analysis of Limitations of AI Tools for Pediatric Speech Language Pathology Documentation and Mitigation Strategies
(University of Waterloo, 2025-10-17) Tuinstra, Tia
Speech Language Pathology (SLP) is a therapy discipline offered by KidsAbility, a pediatric rehabilitation clinic in Southern Ontario. Documentation is a key part of SLP and other therapy practice guidelines and can take up significant portions of a therapist’s time. AI-based clinical documentation aids have been developed to help reduce this burden, and one such tool - MutuoHealth’s AutoScribe - has been piloted by KidsAbility. Though this AI tool has been beneficial to some therapy disciplines, the SLP clinicians face unique challenges when using these tools. The model seemed unable to recognize speech therapy strategies or to parse the play-based script of pediatric appointments. This thesis seeks to explore the issues SLPs encounter with AI documentation tools and propose potential approaches to mitigate these issues. The AI documentation process was divided into the transcription pipeline, where an audio file input produced a corresponding transcript output, and the generation pipeline, where an input transcript produced a draft SOAP note. The SLPs who had participated in the AutoScribe pilot test were interviewed about their experiences with the tool and its integration into their workflows. The issues reported by the therapists were sorted into those more closely related to the transcript and those more closely related to the drafted SOAP note. A set of sample SLP appointments from KidsAbility were gathered from an extended AutoScribe pilot, with 10 selected as examples of appointment data (audio, transcripts, drafted and final SOAP notes) to test the transcription and generation pipelines. An augmented automatic speech recognition (ASR) pipeline based on a Whisper model was used to test improvements to the transcript. However, the generated transcripts were not significantly improved from the pilot test. Instead, ground truth transcriptions were manually created from the audio files to use for testing the generation pipeline. For SOAP note generation, the addition of discipline-specific context tailored to appointment type was tested. This context was curated in collaboration with SLPs from KidsAbility to include SOAP templates, definitions of key concepts, and information about speech data. A Llama 3.3 70B model was used for SOAP note generation with ground truth transcriptions and SLP specific RAG-adjacent information as context. The input context was optimized over several iterations based on clinicians’ evaluations of generated SOAP note quality. KidsAbility’s SLPs had flagged sessions targeting speech practice as having particular difficulties with AutoScribe. The model seemed unable to make inferences about the child’s speech quality from the transcript alone. Methods of quantitatively assessing speech based on session audio were explored as ways to provide additional context on speech quality to the SOAP generation model. A sample appointment was selected for testing, and child speech samples of the targeted sound were sliced from the audio and assigned quality categories. These samples were then compared against correct productions using the cosine distance between their mel-spectrograms. The samples were also passed through a phoneme-based ASR model to get the layer activations. The cosine distances and layer outputs were then tested as predictive measures of articulation accuracy, with layer outputs yielding the best results. The resulting speech accuracy scores were then passed into the generation model as additional context, with the output containing correct statements about the nature of the child’s articulations. Though clinicians’ availability limited the extensiveness of generated SOAP note evaluations, the SOAP notes generated with SLP-specific context showed improvement compared to the basic model generation. The model also tended to repeat information from previous SOAP notes if examples were provided. It was found that quantitative speech analysis does seem possible using phoneme model layer activations and cosine distances between the mel-spectrograms of correct articulations. Based on these findings, further optimizations to the generation pipeline and work on making effective AI tools for KidsAbility’s EY SLPs will continue.
Using eye tracking to study the takeover process in conditionally automated driving and piloting systems
(University of Waterloo, 2025-10-08) Ding, Wen
In a conditionally automated environment, human operators are often required to resume manual control when the autonomous system reaches its operational limits — a process referred to as takeover. This takeover process can be challenging for human operators, as they must quickly perceive and comprehend critical system information and successfully resume manual control within a limited amount of time. Following a period of autonomous control, human operators’ Situation Awareness (SA) may be compromised, thus potentially impairing their takeover performance. Consequently, investigating potential approaches to enhance the safety and efficiency of the takeover process is essential. Human eyes are vital in an individual’s information gathering, and eye tracking techniques have been extensively applied in the takeover studies in previous research works. The current study aims at enhancing the takeover procedure by utilizing operators’ eye tracking data. The data analysis methods include machine learning techniques and the statistical approach, which will be applied to driving and piloting domains, respectively. Simulation experiments were conducted in two domains: a level-3 semi-autonomous vehicle in the driving domain and an autopilot-assisted aircraft landing scenario in the piloting domain. In both domains, operators’ eye tracking data and simulator-derived operational data were recorded during the experiments. The eye tracking data went through two categories of feature extractions: eye movement features linked predominantly to fixation and saccades, and Area-of Interest (AOI) features associated with which AOI the gaze was located. Eye tracking features were analyzed using both traditional statistical techniques and machine learning models. Key eye tracking features included fixation-based metrics and AOI features, such as dwelling time, entry count, and gaze entropy. Operators’ SA and takeover performance were measured by a series of domain-specific metrics, including Situation Awareness Global Assessment Technique (SAGAT) score, Hazard Perception Time (HPT), Takeover Time (TOT) and Resulting acceleration. Three research topics were discussed in the current thesis and each topic included one driving study and one piloting study. In topic 1, significant differences in eye movement patterns were found between operators with higher versus lower SA, as well as between those with better and worse takeover performance. Besides the notable differences in various Area-of-Interests (AOIs) across three pre-defined Time windows (TWs), in the driving domain, drivers with a better SA and better takeover performance showed inconsistent eye movement patterns after the Takeover Request (TOR) and before they perceived hazards. In the piloting domain, pilots with shorter TOT showed more distributed and complex eye movement pattern before the malfunction alert and after resuming control. During the intervening period, their eye movements were more focused and predictable, indicating fast identification of necessary controls with minimal visual search. In topic 2, significant differences in eye movement patterns were observed between younger and older drivers, as well as between learner and expert pilots. As for driving domain, older drivers exhibited more extensive visual scanning, indicating difficulty in effectively prioritizing information sources under time pressure. In piloting domain, expert pilots not only allocate more attention to critical instrument areas but also dynamically adjust their scanning behavior based on the current tasks. In topic 3, machine learning models trained on eye tracking features successfully performed binary classification for both SA-related and takeover performance related metrics. Model performance was evaluated using standard classification metrics, including accuracy, precision, recall, F1-score, and Area Under the ROC Curve (AUC). Finally, comparisons were made across Topics 1 and 2, as well as between the driving and piloting domains. The results suggest that better operators can flexibly adapt their gaze strategies to meet task demands, shifting between broad visual scanning and focused searching when appropriate. This shift in patterns underscores the importance of accounting for the specific Time window (TW) when interpreting operators’ eye movements. Overall, this thesis advances the understanding of different eye movement patterns during the takeover process by exploring a range of eye tracking features. The findings support the development of operator training programs and the design of customized interfaces to enhance the safety and efficiency of takeover performance.
Long-distance Travel in Canada: Multimodal Modeling with a National Network
(University of Waterloo, 2025-09-24) Hajimoradi, Moloud
Long‑distance (LD) travel comprises a disproportionately large share of total passenger-kilometers despite representing a small fraction of trip counts. Yet LD travel remains underexamined in Canada’s vast geographic context. This thesis develops and applies a comprehensive modeling framework to analyze LD trip generation and mode choice for Canadian residents, leveraging data from Statistics Canada’s National Travel Survey (NTS) (January 2018–February 2020) and a new national multimodal transportation network construct for this thesis. The network integrates geospatial centroids for Census Subdivisions with travel-time estimates for automobile, air, intercity rail, and bus modes. Trip generation was examined through both disaggregate (person‑level hurdle and zero‑inflated count models) and aggregate (origin‑destination zone‑pair hurdle models) approaches, incorporating socioeconomic variables (age, income, gender), trip attributes (distance, season), and accessibility measures. Results indicate that accessibility, rather than traditional demographics, may be an important variable in predicting whether a LD trip occurs: with lower local accessibility and greater distance to airports increasing the likelihood of at least one trip in the given month. However, once the trip “hurdle” is crossed, trip counts are less sensitive to accessibility, underscoring behavioral impacts. Even with the very large dataset, models are very weak suggesting that travel surveys are a weak method for understanding LD travel. Mode choice was analyzed using a Multinomial Logit (MNL) model alongside Machine Learning (ML) classifiers (Decision Trees, Random Forests, Support Vector Machines, Neural Networks). While MNL yields interpretable elasticities, with intercepts confirming preference for the driving mode and positive income effects for air travel, ML methods achieve superior predictive power. Feature importance from Random Forests highlights travel time (especially driving) as the dominant determinant, followed by accessibility, with sociodemographic and seasonal factors playing secondary roles. Mode choice models with alternative specific travel times are viable with publicly available data and these results support the need to seriously consider use of ML in LD mode choice even though understanding the influence of individual behavioral factors becomes more limited. Long-distance passenger travel demands models are not typically available in Canada despite their utility for infrastructure, service and environmental planning. This thesis research demonstrates models are viable with existing publicly available data.
Teach a robot to assemble a bolt to a nut with a handful of demonstrations
(University of Waterloo, 2025-09-23) Yao, Xueyang
This thesis investigates data-efficient methods for learning and executing complex, multistep robotic manipulation tasks in unstructured environments. A two-level hierarchical framework is first proposed, in which high-level symbolic action planning is performed using Vector Symbolic Architectures (VSA), and low-level 6D gripper trajectories are modeled using Task-parameterized Probabilistic Movement Primitives (TP-ProMPs). This approach enables both interpretable planning and motion generalization from limited human demonstrations. Building on this foundation, the thesis introduces the Task parameterized Transformer (TP-TF), a unified model that jointly predicts gripper pose trajectories, gripper states, and subtask labels conditioned on object-centric task parameters. Inspired by the parameterization strategy of Task-parameterized Gaussian Mixture Models (TP-GMMs), the TP-TF retains the data efficiency of classical Programming by demonstration (PbD) methods while leveraging the expressiveness and flexibility of transformer-based architectures. The model is evaluated on a real-world bolt–nut assembly task and achieves a 70% success rate with only 20 demonstrations when combined with visual servoing for precision-critical phases. The results highlight the potential of combining structured representations with deep sequence modeling to bridge symbolic reasoning and continuous control. This work contributes a step toward scalable, more interpretable, and data-efficient learning frameworks for autonomous robotic manipulation.
Experimental Study on the Vibration Response of a Jackleg Hammer Drill
(University of Waterloo, 2025-09-22) Kuppa, Srividya
This thesis presents an experimental investigation into the response of mechanical vibration in jackleg hammer drills during underground rock drilling operations. While previous studies have primarily focused on vibration exposure at the handle or operator interface, this work analyzes vibration transmission through the full structure of the drill to better understand internal component behavior under realistic working conditions. Vibration data were collected using uniaxial accelerometers mounted on four key components-the fronthead, main cylinder, backhead, and handle, with measurements recorded along three spatial axes. Testing was conducted in operational environments, capturing variations across distinct drilling phases, including collaring, sustained drilling, and retraction. The acquired data were processed using time and frequency domain methods, including Fast Fourier Transform (FFT) and Root Mean Square (RMS) analysis. Results revealed significant directional dependence of vibration, with the axial (X-axis) component exhibiting the highest amplitudes during drilling. During collaring, when the drill bit lacks a guiding groove, vibration increased across all axes. A resonance condition was observed at approximately 142 Hz in the handle assembly, suggesting localized amplification potentially due to dynamic interaction between structural components. By characterizing dominant frequencies, directional behavior, and phase-specific amplification trends, this study provides a system-level understanding of vibration response in jackleg drills. The findings establish a foundation for future research aimed at developing targeted design improvements and vibration mitigation strategies to enhance operator safety and tool performance.
Active upper-limb prosthesis for swimming
(University of Waterloo, 2025-09-22) Korovkina, Vasylysa
Swimming presents unique challenges for individuals with upper-limb absence, as it requires complex coordination, symmetric propulsion, and control of body balance. Despite these demands, swimming also offers significant rehabilitative benefits, promoting muscle engagement, joint mobility, and cardiovascular health. However, current upper-limb prostheses rarely address the specific requirements of swimming, and most available devices remain passive or non-functional in aquatic environments. This thesis proposes a novel design for a swimming-specific forearm prosthesis that actively adjusts wrist orientation in response to the swimmer's forearm motion. By segmenting the freestyle stroke into distinct phases—entry, catch, pull, push, and recovery—the control system simplifies the dynamic task into manageable sub-problems. Each phase is associated with a predefined wrist rotation, which simplifies control by reducing the continuous motion into a set of discrete, easily manageable movements. The phase recognition is performed using an inertial measurement unit (IMU), which monitors arm motion and triggers appropriate commands to the servo motor embedded in the prosthesis. A prototype was developed to demonstrate this concept. The device includes a custom 3D-printed forearm structure that houses all electronic components—IMU, servo motor, controller, and battery—and uses a belt-pulley mechanism to transmit motion. Although the prototype is not yet suitable for use in water, it effectively demonstrates the feasibility of phase-based control for dynamic tasks such as swimming. The design emphasizes modularity, simplicity, and the potential for future development into a fully water-resistant system suitable for rehabilitation, recreational use, or part of training for competitive para-swimmers.
Development and Optimization of Terahertz Detection Systems
(University of Waterloo, 2025-09-22) Zhou, Rui
Terahertz (THz) technology, occupying the spectral range between microwaves and infrared radiation (0.1–10 THz), has rapidly emerged as a compelling focus of scientific and engineering research. Characterized by its non-ionizing nature, strong penetration through non-metallic materials, and high sensitivity to molecular composition and water content, THz radiation offers significant potential across a broad spectrum of applications. THz detection systems form the foundational core of THz technology platforms, serving the critical function of converting incident terahertz radiation into quantifiable electrical signals. The effectiveness and applicability of THz technology depend heavily on the performance of these detection systems. However, current THz detection technologies remain limited in several critical aspects, including sensitivity, accuracy, temporal resolution, and scalability. Specifically, weak interaction between THz waves and detectors often restricts signal detection, while environmental noise and material inconsistencies reduce measurement fidelity. Furthermore, the relatively slow response times hinder the real-time capture of dynamic biological and communicational processes. These challenges pose substantial barriers to the development of high-performance THz systems, constraining the practical implementation of THz technology's immense potential across various fields. In response to these challenges, this research presents a comprehensive exploration of advanced THz detection strategies, introducing multiple technological innovations aimed at achieving accurate, sensitive, fast, compact, and cost-effective THz detection solutions. Central to this work is the design of a novel graphene-integrated microbolometer, forming the core of a THz Microbolometer Array Imaging System (MAIS). This microbolometer features an optimized structural design and tailored material composition, significantly improving responsivity, detectivity, and response time. This innovative microbolometer design not only sigfinicantly enhances the THz detection performance, but also establishes a solid foundation for advancing THz imaging applications. Complementing detection methodology development, a Micro Circular Log-Periodic Antenna (MCLPA) was designed and optimized using a custom-developed Evolutionary Neural Network (ENN). This algorithm-driven approach enables efficient optimization of the sophisticated antenna design, resulting in a compact structure with broad bandwidth, high gain, and optimal impedance matching. This ENN-driven MCLPA represents a significant breakthrough in THz antenna engineering, introducing a transformative design paradigm that synergistically integrates algorithmic intelligence with structural innovation. In conclusion, this work significantly contributes to overcoming key limitations in THz detection by integrating advancements in novel device architecture, advanced material engineering, and innovative algorithm-driven design methodology. These innovations collectively enhance sensitivity, accuracy, response speed, system compactness and cost-effectiveness, representing a considerable step forward in the performance of THz detection systems. Beyond technical improvements, the results provide a solid foundation for practical implementation in biomedical and other high-impact applications. Overall, the contributions made herein substantially advance the development of THz technology and offer promising pathways for its transformative application across scientific, industrial and clinical fields.
Switching User Perspectives: Using Virtual Reality to Understand Users for User Experience Research and Design
(University of Waterloo, 2025-09-22) Lee, Jieun
Virtual reality (VR) technology can be used as a tool in user experience (UX) research and design to understand and empathize with users. Several works show that the perspective-taking ability of VR in a simulated, immersive environment is helpful in fostering empathy and understanding, which is a crucial first stage within the UX design thinking process. However, it remains unclear how VR could be used to understand multiple users, reflecting common UX research projects, and what kinds of perspective-taking interaction better facilitate empathy and understanding of the users. This thesis introduces switching perspectives interaction in VR to understand different user problems, reflecting the nature of UX research. We conducted a mixed-methods, between-participant study comparing the switching perspectives to two different user perspectives in order to investigate how different types of perspective-taking influence researchers’ and designers’ understanding and empathy for user problems and needs. The findings show that both affective and cognitive empathy are fostered across the different perspectives, shown through the avatar embodiment questionnaire and the interpersonal reactivity index questionnaire. The qualitative data indicate that Switching Perspectives influences participants to see the bigger picture of the user problem, prompting them to ideate solutions that impact both user groups compared to Single-Perspective. With these findings, my thesis aims to explain how and why these understandings and empathy are facilitated towards specific user groups. I conclude this thesis with suggestions on using Switching Perspectives and Single-Perspectives in UX research, with recommendations on the general usage of perspective taking in VR to empathize better and understand user problems.
Computer Vision in Ice Hockey: Realistic Stick Augmentation and Virtual Game Reconstruction
(University of Waterloo, 2025-09-22) Chomko, Vasyl
Ice hockey presents a uniquely challenging environment for computer vision due to its fast pace, heavy occlusions, motion blur, and the visual similarity of objects like sticks and players. Compounding these issues is a lack of annotated data, which limits the effectiveness of modern deep learning models. This thesis addresses two key problems in this domain: robust detection of hockey sticks under visual noise, and the generation of realistic, annotated video data to support model development and analytics. First, we introduce Synthetic Local Data Augmentation (SLDA), an instance-level augmentation strategy tailored for hockey stick segmentation. SLDA injects real segmented stick masks into broadcast images using context-aware transformations such as motion blur, geometric scaling, and lighting adjustments. Applied to a custom dataset of over 4,000 stick annotations, SLDA significantly improves segmentation accuracy on occluded and fast-moving sticks, yielding gains of up to +5.8% mAP@50 and +4.0% F1- score over strong baselines. Second, we present a Unity-based 3D Hockey Simulation Tool that reconstructs entire hockey game sequences from tabular puck and player coordinates. The simulator animates realistic gameplay with fully configurable cameras, enabling multi-angle video synthesis and precise ground-truth annotations. This tool is designed to support model training, evaluation, and visual analytics by converting coordinate logs into synchronized multi-view videos paired with dense per-frame annotations—player bounding boxes, player rink coordinates, segmentation masks, and full camera parameters (intrinsics and extrinsics). A full quantitative assessment of these uses is left for future work. Together, these contributions demonstrate how targeted augmentation and synthetic simulation can overcome real-world data limitations in sports vision. By combining SLDA’s image-level enhancements with full-scene reconstruction, this thesis lays a foundation for more robust, scalable, and intelligent systems in automated hockey analytics and beyond.
Investigating the gap between research and practice in Robot-Based therapy for Autism Spectrum Disorder
(University of Waterloo, 2025-09-22) Saiko, Sabrina
The use of social robots as therapeutic tools for children with Autism Spectrum Disorder (ASD) has emerged as a promising area of research in recent years. Numerous studies have demonstrated the potential of these robots to support the development of social skills, including improvements in eye contact, verbal and physical communication, joint attention, and other interpersonal behaviors. Researchers such as Dautenhahn, Dickstein-Fisher, and initiatives like AskNAO have documented positive short- and medium-term outcomes across diverse populations and robotic systems. Despite this growing body of evidence, the real-world adoption of social robots in clinical or home-based ASD therapy remains limited. This study investigates the barriers that hinder the practical implementation of social robots in ASD care by analyzing current research trends and comparing them with the perspectives of key stakeholders — namely, clinicians and parents of children with ASD. We employed a mixed-methods approach to gain a multidimensional understanding of these challenges. First, a systematic literature review was conducted using major academic databases, including PubMed, IEEE Xplore, and Scopus. Studies published between January 2016 and December 2024 were examined, focusing specifically on robot-child interaction and therapeutic application in the context of ASD. Our review revealed that although all studies reported measurable benefits for children, none of the robot systems had been systematically integrated into regular therapeutic practices. Several limitations were identified, including a general lack of user testing—most studies involved fewer than ten participants—variation in robot appearance and behavior due to the absence of a standardized design framework, and inconsistent metrics used to assess therapeutic effectiveness. To complement the literature review and gain insight into real-world expectations and constraints, we conducted semi-structured interviews with clinicians and distributed surveys to parents. These qualitative and quantitative methods allowed us to explore stakeholders’ attitudes toward social robots, identify their practical concerns, and gather suggestions for how these technologies could be improved for clinical use. Preliminary findings highlight concerns about inconsistent robotic design, including differences in personality, appearance, and level of customization. In addition, participants cited the lack of intuitive interfaces for therapists and caregivers as a significant obstacle to implementation. Many also pointed to the absence of evidence-based clinical guidelines that would support decision-making and evaluation in real-world therapy settings. Through this research, we aim to bridge the gap between research success and real-life clinical integration by identifying key areas where standardization and clearer communication with users could improve both the research process and product development. Specifically, future work can focus on establishing core therapeutic activities that robots can support, ensuring robots are adaptable and customizable to meet individual needs, defining cost-effectiveness benchmarks to guide adoption decisions, and promoting the development of ASD-specific robotic solutions. The findings from this study offer a roadmap for researchers, designers, and healthcare providers to collaboratively address the challenges preventing broader adoption of social robots in autism care. By aligning technological innovation with stakeholder needs, this work contributes to a more realistic and practical understanding of how to transition social robotics from research labs into everyday therapeutic contexts.
From Far-Field Dynamics to Close-Up Confidence: Action Recognition Across Varying Camera Distances
(University of Waterloo, 2025-09-22) Buzko, Kseniia
Human action recognition (HAR) refers to the task of identifying and classifying human actions within videos or sequences of images. This field has gained significant importance due to its diverse applicability across domains such as sports analytics, human-computer interaction, surveillance, and interpersonal communication. Accurate action recognition becomes especially difficult when the camera distance changes, because the cues that matter shift with scale. For instance, a close-up hinges on facial emotion (such as smiles and eye gaze), whereas a medium shot relies on hand gestures or objects being manipulated. In the context of HAR, we distinguish two primary scenarios that illustrate this challenge. The first is the far-field setting, characterized by subjects positioned at a distance and often exhibiting rapid movement, which leads to frequent occlusions. This scenario is commonly observed in sports broadcasts, where capturing the game’s dynamics is essential. In contrast, the near-field setting involves subjects that are nearby and tend to remain relatively static. This setting enables the capture of subtle yet informative gestures, similar to those observed in presenter-focused videos. Although most studies treat these regimes separately, modern media (films, replays, vlogs) cut or zoom fluidly between them. An effective recognizer must therefore decide dynamically which cues to prioritize: facial emotion in tight close-ups, hand or torso motion in medium shots, and full-body dynamics in wide views. Despite substantial progress, current HAR pipelines rarely adapt across that zoom continuum. This thesis therefore asks: What scale-specific hurdles confront human action recognition in far-field, near-field, and zoom-mixed scenarios, and how can insights from separate case studies keep recognition robust when the camera sweeps from full-body scenes to tight close-ups and back again? To answer, we contribute three scale-aware systems: 1. Hockey Action Identification and Keypose Understanding (HAIKYU) (far-field). For hockey broadcasts, we introduce temporal bounding-box normalization, which removes camera-induced scale jitter, and a 15-keypoint skeleton that adds stick endpoints. Combined with normalization, this improves Top-1 accuracy from 31% to 64%, showing that stick cues are indispensable for ice-hockey actions. 2. Confidence Fostering Identity-preserving Dynamic Transformer (CONFIDANT) (near-field). We curate a 38-class micro-gestures dataset and train an upper-body action recognizer that flags unconfident cues, such as folding arms, crossing fingers, and clasping hands. A diffusion-based video editor then rewrites these segments into confident counterparts, serving as a downstream demonstration of fine-grained recognition. 3. Scale-aware routing framework for mixed-zoom action recognition (Zoom-Gate) (zoom-mixed). A lightweight zoom score derived from the bounding-box area and the density of detected keypoints routes each tracklet to the specialist model best suited to that scale. Experiments confirm that this scale-aware routing, combined with context-specific skeletons, delivers robust performance across mixed-zoom datasets. Collectively, these contributions demonstrate that coupling scale-aware preprocessing with context-specific skeletons can maintain pose-centric HAR reliability across the zoom spectrum. The resulting frameworks open avenues for real-time segmentation, multi-view fusion, and ultimately a unified, scale-invariant action understanding pipeline.
Understanding AI’s Impact on Clinical Decision-Making: A Comparative Study of Simple and Complex Primary Care Scenarios
(University of Waterloo, 2025-09-19) Mehri, Sormeh
Clinical decision-making is a complex cognitive process shaped by multiple factors, including cognitive biases, clinical context, and the integration of healthcare technologies. This thesis investigates how the introduction of artificial intelligence (AI)-enabled decision support tools influences clinical reasoning processes in primary care settings. Using Cognitive Work Analysis (CWA), Decision Ladder (DL) frameworks, and content analysis methods, this study qualitatively examines clinician decision-making behaviors across traditional electronic medical record (EMR) environments and AI-supported scenarios. Fourteen clinicians from Ontario, Canada, participated in scenario-driven sessions involving routine (uncomplicated urinary tract infections) and complex (mental health distress) cases. Analysis revealed distinct cognitive shortcuts, shifts, and reliance patterns influenced by AI. Specifically, AI systems reinforced heuristic-driven decisions for routine cases but introduced additional cognitive demands in complex scenarios due to information integration requirements. Visual emphasis in the DLs highlighted AI-driven cognitive shortcuts and behavior modifications. Limitations include scenario-driven constraints and a small, region-specific sample with similar EMR and AI experiences. Future research should explore mid-complexity scenarios, incorporate diverse clinician populations, and evaluate long-term effects of AI integration on clinical reasoning. This work contributes to understanding the nuanced interplay between cognitive processes and AI technology, informing user-centered design strategies for healthcare decision support systems.
Multi-level Temporal Understanding in Video Analysis: From Action Recognition to Quality Assessment
(University of Waterloo, 2025-09-17) Li, Yaoxin
Artificial intelligence algorithms have permeated virtually every facet of contemporary life, from personalized shopping recommendations and targeted advertising to search engine optimization and multimedia content delivery. Among these applications, video content—as a predominant carrier of information in multimedia—occupies a position of exceptional significance. However, the application of AI algorithms to video analysis remains insufficiently mature, with comprehensive video content interpretation persisting as a critical challenge. This dissertation presents an integrated framework for multi-level temporal understanding in video analysis, advancing a coherent progression from fundamental action recognition to temporal localization and ultimately to qualitative assessment. The foundation of our integrated approach begins with addressing fundamental limitations in current action recognition paradigms. We identify a critical gap in existing methodologies where trimmed approaches restrict model training to curated action segments, while untrimmed methods rely solely on video-level supervision. Both approaches fail to exploit the inherent complementarity between action and non-action segments within complete temporal sequences. To address this limitation, we introduce a novel multi-stage contrastive learning architecture that hierarchically extracts motion-critical features through coarse-to-fine-to-finer temporal contrasting. This approach establishes a self-correcting learning regime where action discriminability emerges from explicit comparisons against non-action references, effectively suppressing static bias amplification while enhancing temporal sensitivity. Building directly upon this temporal understanding foundation, our second research direction extends these temporal insights to the challenge of precise action localization within continuous video streams. This integrated framework incorporates a multi-modal fusion classifier with adaptive modality weighting, a Class-Semantic Attention mechanism for precise proposal generation, and cross-domain prototype alignment enabling knowledge transfer between trimmed and untrimmed paradigms. This advancement represents a natural progression from recognizing what actions occur to precisely determining when they occur within the temporal dimension. This system was validated through participation in the CVPR ActivityNet Temporal Action Localization challenge, achieving second-place rankings in both 2021 and 2022 editions, with the action classification component attaining first place on the validation set—demonstrating the practical efficacy of our theoretical contributions. The third component completes our multi-level temporal understanding framework by addressing how well actions are performed—moving beyond recognition and localization to qualitative assessment. This progression reflects the natural evolution of video understanding from basic detection to increasingly nuanced temporal analysis. We develop a novel Mamba-based framework that leverages selective state space models to capture long-range dependencies in sequential data. This approach enables fine-grained analysis of temporal patterns and subtle motion nuances that determine action quality across various domains including sports performance, physical rehabilitation, and skill assessment. Experimental results demonstrate superior performance compared to transformer and CNN-based methods in both accuracy and computational efficiency. The collective contributions of this dissertation establish a comprehensive framework for multi-level temporal understanding in videos. Rather than representing isolated research directions, these three components form a coherent progression from fundamental action recognition (what), to precise temporal localization (when), and finally to qualitative assessment (how well). This integrated approach advances both the theoretical understanding and practical capabilities in AI-driven video analysis, offering a unified perspective that bridges the gap between academic research and real-world applications across the spectrum of video understanding tasks. By addressing these complementary facets within a unified framework, our work establishes new foundations for temporal video understanding that can be applied across diverse application domains.
Studying the Biomechanics of a Wheelchair Basketball Free Throw using Pose Estimation
(University of Waterloo, 2025-09-16) Mohammad, Hisham
Wheelchair basketball is a popular Paralympic sport where athletes with varying disabilities compete under a point-based classification system. Lower-class athletes (1.0–2.5), with higher levels of disability, often struggle to engage their trunk and core muscles, while higher-class athletes (3.0–4.5) have greater functional ability and utilize their trunk extensively. Coaches must consider these functional disparities when formulating strategies and designing individualized training regimens. Consistent free-throw shooting is critical in wheelchair basketball, as it offers an uncontested scoring opportunity. Higher-class athletes, who incorporate trunk motion, rely less on their arms for force generation, resulting in distinct shooting mechanics. Given the biomechanical variability arising from these physical differences, understanding individual shooting techniques is vital for optimizing performance. Motion capture technologies are widely employed to analyze and improve athletic movements. However, traditional systems, such as wearable sensors and marker-based motion tracking, are often costly, time-intensive, and restrictive to mobility. Markerless motion capture systems address these limitations using computer vision techniques like pose estimation. Convolutional neural networks (CNNs) trained on large human image datasets can accurately detect joints and limbs, enabling real-time analysis. Commercial systems typically require multiple cameras, but deploying pose estimation CNNs on mobile devices allows motion analysis using only a built-in camera, enhancing portability and accessibility for sports training and biomechanical research. This research focuses on designing and deploying pose estimation models within a mobile application to analyze the shooting arm's motion during a basketball free throw, with specific considerations for wheelchair basketball players. The pose estimation models, trained on a COCO-WholeBodyTM dataset to detect fingertip positions, were deployed on an iPhone and tested for accuracy and computational performance, particularly real-time motion analysis. The derived joint positions are used to calculate kinematic and dynamic metrics, including joint angles and torques. The system's joint angle calculations were compared against the Vicon motion capture system. While upper arm and elbow angle errors had a root mean squared error (RMSE) within an acceptable range (less than 20◦), wrist angle errors exceeded 65◦ due to limitations in pose estimation accuracy and the iPhone camera's frame rate. To demonstrate the system's utility, two shooting studies were conducted: (1) a comparison of biomechanics between one-motion and two-motion shooting techniques and (2) a biomechanical analysis of the shooting arm contrasting a national-level class 1 wheelchair basketball athlete with class 4.5 able-bodied participants shooting from a basketball wheelchair.
SMaT-HSI: Structure-aware Mamba-Transformer Hybrid Model for Hyperspectral Image Classification
(University of Waterloo, 2025-09-11) Liu, Yaxuan
Hyperspectral image (HSI) classification is a crucial task in remote sensing, playing a fundamental role in environmental monitoring, precision agriculture, urban planning, and mineral exploration. By leveraging the rich spectral information across hundreds of contiguous bands, HSI classification enables precise identification of materials and land cover types, facilitating accurate mapping of vegetation, soil, water bodies, and built environments. Traditional convolutional neural network (CNN)-based methods effectively extract local spatial features, while transformer-based models excel in capturing global contextual dependencies. However, both approaches face challenges in fully leveraging the spectral and spatial dependencies inherent in hyperspectral data. Recently, Mamba, a state-space model (SSM)-based architecture, has shown promise in sequence modeling by efficiently capturing long-range dependencies with linear computational complexity. A comprehensive comparison of CNN-based, transformer-based, and Mamba-based models for HSI classification reveals that Mamba-based models achieve performance comparable to transformer-based models, highlighting their potential in this domain. Current Mamba-based methods often convert images into one-dimensional sequences and use scanning strategies to capture local spatial and spectral dependencies. However, these approaches struggle to fully represent the intricate spectral-spatial structures in HSIs and introduce computational redundancy. To address this, a structure-aware state fusion mechanism is proposed to explicitly model the spatial and spectral relationships of neighboring features in the latent state space, enabling more efficient and accurate representation learning. To further improve the capture of global context and long-range spatial dependencies, a hybrid Mamba-transformer architecture is explored. Different integration strategies are investigated, including inserting transformer blocks in the earlier, middle, and final layers, as well as at regular intervals. Analysis indicates that incorporating a self-attention block in the final layer achieves the highest average overall accuracy of 97.58% across the five datasets. The proposed approach is evaluated on five publicly available benchmark datasets—IndianPines, Pavia University, Houston 2013, WHU-Hi-HanChuan, and WHU-Hi-HongHu, demonstrating an average overall accuracy improvement of 0.87% compared to the baseline model and competitive results with existing transformer-based and Mamba-based models. These findings underscore the potential of combining Mamba and transformer architectures for efficient and accurate hyperspectral image classification, offering new insights into advanced sequence modeling for remote sensing applications.
Investigating AI-Enabled Space Debris Characterization and Adversarial Resilience
(University of Waterloo, 2025-09-08) Adriano, Anne
The accumulation of space debris in Earth’s orbit has emerged as a major concern for the safety and sustainability of space operations. As more satellites are launched and breakup events occur, the density of debris continues to grow, increasing the likelihood of collisions with active spacecraft. To maintain reliable space operations, there is a pressing need for methods that can accurately identify and characterize debris, enabling improved tracking, collision avoidance, and long-term management of the orbital environment. This thesis investigates the application of machine learning and deep learning models to classify and characterize space debris based on unique synthetic light curve data. Beyond the generation of the light curve dataset, three characterization experiments are presented in this work: (1) attitude classification using Extreme Gradient Boosting (XGBoost) and Wavelet Scattering Transform (WST) features, (2) shape classification using a Long Short-Term Memory (LSTM) network, Fully Convolutional Network (FCN), and a hybrid Long Short-Term Memory – Fully Convolutional Network (LSTM-FCN) model, and (3) multitask learning for simultaneous shape and material classification using the hybrid model. A fourth component of the thesis evaluates the LSTM-FCN model’s robustness against adversarial attacks generated using gradient-based methods. This adversarial study leveraged real publicly available debris and satellite light curve data from the Mini-MegaTORTORA (MMT) database. Results showed that the WST data augmentation method significantly improved classification performance for XGBoost by capturing multiscale frequency features. The LSTMFCN model outperformed both standalone LSTM and FCN models in shape classification tasks, while the multi-task architecture further enhanced performance by leveraging intertask dependencies. The adversarial study revealed that FCN-based surrogate models can produce highly effective attacks against the LSTM-FCN. It was also shown that when combating FCN-based attacks, common filtering-based defenses such as moving-average and wavelet filters are generally insufficient. This work concludes that integrating Artificial Intelligence (AI) into Space Domain Awareness (SDA) is crucial for managing the growing challenge of space debris, but also emphasizes the need to defend such systems from tampering and perturbation. Model reliability is foundational to the future of autonomous space operations and the protection of critical space-based services that support life on Earth.
Advancing Disaggregate Modeling of Electric Vehicle Charging Behaviour
(University of Waterloo, 2025-09-02) Shakhova, Diana
The growing adoption of electric vehicles (EVs) poses both opportunities and challenges for electricity grid management, where management strategies vary by region based on demand, climate and portfolio of generation types. In Ontario, Canada where the energy mix is dominated by baseload nuclear and hydroelectric generation and where residential EV charging is common but not universal, understanding the timing and location of EV charging is critical for infrastructure planning and system reliability. This thesis focuses on non-overnight EV charging behavior, charging events that occur during the day, mostly away-from-home, and explores how infrastructure access and electricity pricing may influence the 24-hour distribution of EV electricity demand. The research addresses the question: do infrastructure location and pricing conditions influence the temporal distribution of non-overnight electric vehicle charging demand in Ontario? To answer this, a novel simulation model was developed that integrates real-world travel data from the 2016 Transportation Tomorrow Survey (TTS) with charging decisions predicted using a discrete choice model estimated from a custom stated-preference survey of current EV users. This survey, still ongoing at the time of writing, has collected over 5,900 responses across 300+ participants, capturing variation in price elasticity stop duration, charger type, and other contextual factors that influence away-from-home charging. The simulation assumes a 10% EV adoption scenario across the Greater Toronto and Hamilton Area (GTHA), generating 24-hour electricity demand profiles under six distinct combinations of infrastructure access and pricing. Results suggest that infrastructure availability may be the primary determinant of when charging occurs throughout the day, while pricing has a stronger influence on how much charging takes place. Scenarios with free and widespread public access produce higher daytime demand, while constrained infrastructure and high pricing result in lower, more diffuse load patterns. However, charging patterns, such as the concentration of demand early in the day, appear sensitive to model assumptions, particularly morning state-of-charge (SoC) initialization and simplified home charging representation. These findings underscore the importance of model calibration and choice behaviour realism in demand modeling efforts. The implications of this work, while preliminary, point to the need for coordinated planning of charging infrastructure and pricing policies that consider charging behaviour as well as actual trip patterns and regional energy system characteristics. The research contributes both a flexible simulation framework and a charging choice model estimated from stated charging behaviour that can be expanded for future planning scenarios. Next steps include further model refinement, validation using real-world charging session data, and explicit inclusion of populations without home charging access. As EV adoption continues to grow, the tools developed in this thesis provide a foundation for anticipating and managing its impact on electricity demand.

Browse

Recent Submissions