Analysis of a Pipelined Architecture for Sparse DNNs on Embedded Systems
A. Alcolea, J. Olivito, J. Resano, H. Mecha.
IEEE Transactions on Very Large Scale Integration Systems, 2020
Deep neural networks (DNNs) are increasing their presence in a wide range of applications, and their computationally intensive and memory-demanding nature poses challenges, especially for embedded systems. Pruning techniques turn DNN models into sparse by setting most weights to zero, offering optimization opportunities if specific support is included. We propose a novel pipelined architecture for DNNs that avoids all useless operations during the inference process. It has been implemented in a field-programmable gate array (FPGA), and the performance, energy efficiency, and area have been characterized. Exploiting sparsity yields remarkable speedups but also produces area overheads. We have evaluated this tradeoff in order to identify in which scenarios it is better to use that area to exploit sparsity, or to include more computational resources in a conventional DNN architecture. We have also explored different arithmetic bitwidths. Our sparse architecture is clearly superior on 32-bit arithmetic or highly sparse networks. However, on 8-bit arithmetic or networks with low sparsity it is more profitable to deploy a dense architecture with more arithmetic resources than including support for sparsity. We consider that FPGAs are the natural target for DNN sparse accelerators since they can be loaded at run-time with the best-fitting accelerator.
Analysis of the reconfiguration latency and energy overheads for a Xilinx Virtex-5 FPGA
J. Olivito, F. Serrano, J. A. Clemente, H. Mecha, J. Resano.
IET Computers & Digital Techniques, 2017
In this study, the authors have evaluated the overhead and the tradeoffs of a set of components usually included in a system with run-time partial reconfiguration implemented on a Xilinx Virtex-5. The authors’ analysis shows the benefits of including a scratchpad memory inside the reconfiguration controller in order to improve the efficiency of the reconfiguration process. They have designed a simple controller for this scratchpad that includes support for prefetching and caching in order to further reduce both the energy and latency overhead.
Accelerating Board Games Through Hardware/Software Codesign
J. Olivito, J. Resano, J.L. Briz.
IEEE Transactions on Computational Intelligence and AI in Games, 2016
Board games applications usually offer a great user experience when running on desktop computers. Powerful high-performance processors working without energy restrictions successfully deal with the exploration of large game trees, delivering strong play to satisfy demanding users. However, nowadays, more and more game players are running these games on smartphones and tablets, where the lower computational power and limited power budget yield a much weaker play. Recent systems-on-a-chip include programmable logic tightly coupled with general-purpose processors enabling the inclusion of custom accelerators for any application to improve both performance and energy fficiency.
In this paper, we analyze the benefits of partitioning the artificial intelligence of board games into software and hardware. We have chosen as case studies three popular and complex board games, Reversi, Blokus, and Connect6. The designs analyzed include hardware accelerators for board processing, which improve performance and energy efficiency by an order of magnitude leading to much stronger and battery-aware applications. The results demonstrate that the use of hardware/software codesign to develop board games allows sustaining or even improving the user experience across platforms while keeping power and energy low.
Performance and energy efficiency analysis of a Reversi player for FPGAs and general purpose processors
J. Olivito, R. Gran, J. Resano, C. González, E. Torres.
Microprocessors and Microsystems, 2015
Board-game applications are frequently found in mobile devices where the computing performance and the energy budget are constrained. Since the Artificial Intelligence techniques applied in these games are computationally intensive, the applications developed for mobile systems are frequently simplistic, far from the level of equivalent applications developed for desktop computers.
Currently board games are software applications executed on General Purpose Processors. However, they exhibit a medium degree of parallelism and a custom hardware accelerator implemented on an FPGA can take advantage of that.
We have selected the well-known Reversi game as a case study because it is a very popular board game with simple rules but huge computational demands. We developed and optimized software and hardware designs for this game that apply the same classical Artificial Intelligence techniques. The applications have been executed on different representative platforms and the results demonstrate that the FPGAs implementations provide better performance, lower power consumption and, therefore, impressive energy savings. These results demonstrate that FPGAs can efficiently deal with this kind of problems.
An improved FPGA-based specific processor for Blokus Duo
J. Olivito, A. Delmás, J. Resano.
International Conference on Field-Programmable Technology, 2014
This article presents a hardware design of a specific processor for Blokus Duo game. This design is an evolution of our previous work presented in the ICFPT’13 Design Competition. In order to improve its performance we have designed parallel hardware blocks to speed up the most time-consuming tasks, and included additional techniques to reduce the search space. As a consequence we can process a board six times faster than in our previous version and we prune the game-tree much more efficiently.
An FPGA-based specific processor for Blokus Duo
J. Olivito, C. González, J. Resano.
International Conference on Field-Programmable Technology, 2013
In this article, we present a design of a specific processor for Blokus Duo game. This design has been submitted to the ICFPT’13 Design Competition and implemented on a low-cost Spartan-6 FPGA. Our player applies several techniques to identify which movements are potentially interesting, and applies a search-tree in order to evaluate the consequences of each of those options. To achieve an efficient implementation we have developed custom modules to manage the board and to identify whether a block can be placed or not in a given vertex. The results demonstrate that our design is competitive, even against advanced Blokus Duo players, such as the Pentobi software application considered as the best available software player.
FPGA implementation of a strong Revesi Player
J. Olivito, C. González, J. Resano.
International Conference on Field-Programmable Technology, 2010
In this article, we present a design of a Reversi player submitted to the FPT’10 Design Competition and implemented on a XC2VP30 Virtex-II Pro FPGA. Our player applies several techniques to explore the solution space attempting to look as many moves forward as possible for the given time, and uses several metrics to evaluate the quality of a given board. The most important metric is the mobility, basically our player attempts to maximise its available moves whereas minimising the opponent moves. With these techniques our player easily defeats the competition software opponent.
An initial specific processor for Sudoku solving
C. González, J. Olivito, J. Resano.
International Conference on Field-Programmable Technology, 2009
In this article, we present a design of a Sudoku solver submitted to the FPT Design Competition. Using only the on-chip resources of a XC2VP30 Virtex-II Pro FPGA we have designed a specific processor that can solve Sudokus from order 3 to 11. This processor applies a Branch&Bound approach to explore the solution space. However, this solution space is too big, and frequently the time needed to solve the Sudokus is excessive. In order to attempt to improve the design, we have implemented an equivalent software version, and added several heuristics that reduce the solution space. The results have shown that these heuristics can drastically speedup the search. Hence we have included a few of them in the processor: Singles, Hidden Singles, Hidden Pairs, Hidden Triplets and Hidden Quartets that are well-known in the Sudokus literature. This design can still be improved by including other heuristics like Locked Candidates or Naked Candidates. Moreover, it is possible to extend the design for larger Sudokus since all the modules are customizable, but since the on-chip memory resources are very limited, it would be needed to use an external DDR RAM memory.