Repository logo
About
Deposit
Communities & Collections
All of UWSpace
  • English
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Latviešu
  • Magyar
  • Nederlands
  • Português
  • Português do Brasil
  • Suomi
  • Svenska
  • Türkçe
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
Log In
Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Elayat, Omar"

Filter results by typing the first few letters
Now showing 1 - 1 of 1
  • Results Per Page
  • Sort Options
  • Loading...
    Thumbnail Image
    Item
    Acceleration of Integer Transformer Models Via Structured Resource Management Using FPGAs
    (University of Waterloo, 2025-08-19) Elayat, Omar
    The widespread adoption of Large Language Models (LLMs) in various applications has pushed the demand for efficient hardware acceleration beyond the capabilities of traditional platforms. Due to their highly parallel architecture and ease of deployment, Field Programmable Gate Arrays (FPGAs) are widely used to accelerate LLMs. However, the FPGAs’ limited on-chip memory resources are still too limited to accommodate the trained models. While existing FPGA-based solutions have demonstrated promising throughput and energy efficiency, they often rely on abundant fabric resources, assume high-bandwidth devices that are not suitable for deployment at the edge, or employ highly customized acceleration architectures that are not scalable with the advancements of the LLMs architectures. This thesis addresses these challenges by proposing a novel on-chip resources manager architecture for integer encoder-based transformer inference, with a focus on Bidirectional Encoder Representations from Transformers (BERT) models. We target resource-constrained FPGAs with limited memory bandwidth. We show that, through structured operation scheduling and resource-sharing, significant performance improvements can be achieved. The proposed resource-shared infrastructure is also designed to be modular, allowing newly introduced computation blocks to be easily integrated into the accelerator without requiring major modifications or incurring additional off-chip data movement. Demonstrated on a fully quantized integer-only variant of the BERT model as a representative workload, the proposed system achieves 2.32x latency improvement over the baseline custom accelerator, 1.17x over Jetson Orin Nano GPU, and at least 23.63x over CPU. The design is validated on two FPGAs: the PYNQ-Z1 as a low-end proof-of-concept and the KV260 as a mid-range deployment target.

DSpace software copyright © 2002-2025 LYRASIS

  • Privacy policy
  • End User Agreement
  • Send Feedback