A Template-Based Design Methodology for Graph-Parallel Hardware Accelerators

Andrey Ayupov, Serif Yesil, Muhammet Mustafa Ozdal, Taemin Kim, Steven Burns, and Ozcan Ozturk

Abstract—Graph applications have been gaining importance in the last decade due to emerging big data analytics problems such as web graphs, social networks, and biological networks. For these applications, traditional CPU and GPU architectures suffer in terms of performance and power consumption due to irregular communications, random memory accesses, and load balancing problems. It has been shown that specialized hardware accelerators can achieve much better power and energy efficiency compared to the general purpose CPUs and GPUs. In this work, we present a template-based methodology specifically targeted for hardware accelerator design of big-data graph applications. Important architectural features that are key for energy efficient execution are implemented in a common template. The proposed template-based methodology is used to design hardware accelerators for different graph applications with little effort. Compared to an application-specific high-level synthesis (HLS) methodology, we show that the proposed methodology can generate hardware accelerators with up to 18x better energy efficiency and requires less design effort.

I. INTRODUCTION

Customizing hardware for a target domain of applications can lead to significant power and performance improvements. Especially in the dark silicon era, hardware accelerators play an important role to significantly improve the energy efficiency of compute systems. The basic idea is to integrate specialized hardware accelerators targeted for frequently executed workloads so that these workloads can be executed orders of magnitude more efficiently compared to general purpose CPUs. IBM’s Wire-Speed Processor [11] is an example, where a number of fixed-function hardware accelerators are integrated with processing cores.

The importance of graph applications have been increasing in the last decade especially due to emerging big data problems such as knowledge discovery for web data, social network analysis, natural language processing, and bioinformatics. The typical objective is to extract information from interactions (graph edges) between different entities (graph vertices). Graph analysis is known to be different from traditional grid-based high performance computing because of irregular communication, little data locality, low computation to communication ratio, frequent synchronization requirements, and hard-to-predict work assignment [13].

The performance bottleneck of big data graph applications is typically the DRAM access latency due to the low compute-to-memory ratios and random memory access patterns [7]. Most previous works on hardware accelerators assume that data resides at a local memory with fixed latency. This is typically ensured by processing one partition at a time and overlapping computation of the current partition with the communication of the next partition. However, the large graphs of big data applications and the poor temporal/spatial locality of these real-world graphs make the aforementioned techniques impractical.

Instead, an accelerator is expected to generate many concurrent DRAM requests to be able to hide long (typically hundreds of cycles) latencies and fully utilize the available DRAM bandwidth. It has been shown that hardware accelerators can operate at power levels that are much lower than the state-of-the-art multi-core CPUs [26].

Let us consider a design company that needs to choose which algorithms to accelerate in hardware, e.g. for custom server chips. Let us also assume that this decision will be made based on analyzing the performance benefits and the power/area costs for each application. Following the traditional RTL-based design methodology can lead to weeks or months of development time just to obtain accurate power, performance, and area estimation for one application. This development time can be acceptable for final hardware implementation. However, if there are many potential applications that need to be evaluated this way, this methodology becomes impractical. Instead, one can use High Level Synthesis (HLS) tools to model each application at a higher level to reduce the development and debug time. Although HLS tools are very effective for applications with regular compute patterns [15, 23, 27, 30], they still require low level modeling - hence high development costs - for irregular applications such as graph algorithms, as will be discussed in Section III.

In this paper, we propose a template-based automation methodology for the design of hardware accelerators for graph applications. For this, we rely on the software abstraction models proposed for distributed graph processing frameworks [20].

The contributions of this paper can be summarized as follows:

• We show the limitations of HLS-based accelerator design methodologies for irregular graph algorithms.

• We propose a template-based design methodology for iterative graph algorithms. Since the template is created only once and utilized across many applications, the design effort is amortized. This allows easily incorporating into the generated accelerators important architectural features that are expensive to develop.

• We propose a design-space exploration algorithm that is specifically targeted for graph accelerator architectures. This helps designers choose the desired tradeoff between per-
formance and area/power without the need for low-level parameter tuning.

- We provide a detailed power, performance, and area comparison of graph accelerators generated using 22nm industrial libraries for three different applications. The comparison is done between hardware generated by 1) direct HLS methodology using traditional bulk-synchronous models, 2) the proposed template-based methodology, and 3) a state-of-the-art CPU system. We show that the direct HLS methodology requires more design effort to implement a simpler accelerator architecture, while the proposed methodology can hide the architectural complexities within the common template and achieve up to 18x better energy efficiency.

The rest of the paper is organized as follows. In Section II, we provide background on irregular graph applications and summarize the previous studies in this area. In Section III, we outline the challenges of using an HLS methodology to design accelerators for graph applications. We also describe a generic pipeline for a baseline HLS architecture in that section. In Section IV, we briefly describe the proposed architecture for irregular graph applications. Our proposed template-based design methodology is presented in Section V. Finally, our experimental results are provided in Section VI.

II. BACKGROUND AND RELATED WORK

Efficient execution of a graph algorithm requires both high throughput computation and work efficiency. Throughput of computation can be defined as the number of vertices or edges processed per unit time, and existing studies typically focus on this metric. On the other hand, work efficiency is defined based on the number of vertices or edges processed to complete a given task. This metric is especially important for iterative graph algorithms where a certain convergence criteria needs to be satisfied for completion. There are two factors that directly affect work efficiency of iterative graph algorithms: asynchronous execution and active vertex set.

In a bulk synchronous implementation of a graph algorithm, there are global barriers between iterations, and only the data from the previous iteration can be used. In contrast, an asynchronous implementation allows vertices to access the latest data from neighbors, allowing them to see the updates done in the same iteration. It was shown that asynchronous execution can lead to about 2x faster convergence for some graph algorithms [20, 25].

Another implementation aspect that affects work efficiency of iterative graph algorithms is whether all vertices are processed in every iteration or not. It was shown in the aforementioned works that vertices converge at different speeds and significant work efficiency can be achieved by not processing the vertices that converge earlier than others. For this purpose, the set of active (not-yet-converged) vertices can be maintained. The architectural requirements for such graph applications were expressed in [25] and it was shown that the features that lead to better work efficiency may lead to lower throughput on multicore CPU systems.

Other related work can be summarized as follows. Graph processing is a widely studied problem at both software and hardware levels. A wide range of distributed software frameworks are available, such as Google’s Pregel [21] and Graphlab [20]. Software-level optimizations for existing massively parallel architectures have been proposed such as [19] and [34]. There have also been several proposals for accelerators for specific graph applications. Both PageRank [22] and variations of breadth-first-search and single-source shortest path algorithms [6, 8, 17, 31, 33] have been implemented for FPGAs and ASICs. In addition, there are other studies that attempt to provide abstract templates for regular graph processing [12, 24]. However, these models focus on a bulk synchronous execution model and do not consider the work efficiency aspects described above. Furthermore, they are not well-suited for irregular graph applications with poor-locality of memory accesses.

III. HLS-BASED ACCELERATOR DESIGN

Although HLS tools can be used effectively to design accelerators for compute-oriented applications with regular memory access patterns, it is hard to use them directly on the software implementations of irregular graph applications. First of all, it is not viable to store very large graphs in local memories of an accelerators. Furthermore, as mentioned in Section I, processing one partition at a time is not practical due to the irregular memory access patterns. Hence, the designed accelerator needs to be able to make requests to system memory and be able to hide access latency by scheduling multiple memory requests concurrently. In this section, we describe how to model graph accelerators using HLS tools directly as an alternative to the proposed template based methodology.

In this paper, we assume that the input graphs are stored in the well-known Compressed Sparse Row (CSR) format, where two arrays are used to store the graph topology, denoted in this paper as VertexInfo and EdgeInfo. Entry VertexInfo[i] stores the topology information for vertex i: the corresponding edge offset in EdgeInfo array and the number of incoming/outgoing edges (for directed graphs). Similarly, EdgeInfo[i] stores the neighboring vertex index for edge i. In addition, the application data associated with each vertex and/or edge is stored in arrays VertexData and EdgeData.

For HLS implementations, we have implemented a simple memory subsystem to access DRAM. This model supports multiple outstanding requests to enable memory-level parallelism so that long DRAM latencies can be hidden. For processing vertices, we have adopted the bulk synchronous processing (BSP) model of execution due to its simplicity. In this model, two copies are stored for each vertex/edge data object, corresponding to the previous and current iterations. When a vertex needs to access the data of a neighbor vertex/edge, it reads the data from the previous iteration. Since iterations are separated by barriers, race conditions are avoided.

The microarchitecture we use for the HLS-based design of an application-specific graph accelerator can be represented as the pipeline shown in Figure 1. In this pipeline, vertices are processed in order, however there can be up to 128 vertices being processed at different stages of the pipeline at the same time. Processing many vertices in parallel results in many memory requests originating from different stages of the pipeline and helps utilize the available DRAM bandwidth. We were able to achieve close to peak memory bandwidth utilization with this architecture for the different accelerators we implemented.

Figure 1 shows a generic pipeline without implementation details. This pipeline needs to be implemented for each graph
For each vertex, request vertex info
For each vertex info, request incoming edge info
For each edge info, request neighbor vertex info
Iterate over neighbor vertex data and update local vertex data

Memory subsystem
DRAM

Figure 1: Architecture for application-specific HLS designs

application separately by designers. Here, the VertexInfo and EdgeInfo requests can be made streaming, because the algorithms implemented process all vertices and edges in order. Neighbor vertex data is the only data type that is accessed from memory in an irregular fashion, but it is the performance bottleneck. According to our experiments, a generic cache of size up to 64KB does not provide any significant advantage due to the majority of cache accesses resulting in cache misses. Thus, we do not include a cache in the architecture of the application-specific accelerators we implemented.

Note that we chose to implement a BSP execution model instead of implementing the work-efficient asynchronous model described in Section II. Asynchronous execution requires synchronization between concurrently executing vertices and edges to ensure sequential consistency. This requires a locking mechanism for vertices to avoid Read-After-Write (RAW) and Write-After-Read (WAR) hazards. Vertex locking in turn requires out-of-order execution support for vertex processing that also makes the architecture more complicated.

Similarly, we chose not to implement the active vertex set support described in Section II. Implementing active vertex set support requires a special data structure that has at least one bit (active vs. converged) for all vertices in the graph, which needs to be stored in system memory. Given frequent accesses to the active list, some special caching capability is required with coherence support to guarantee data consistency for potentially multiple vertices requesting and modifying the vertex state from active to converged and vice versa.

Implementing asynchronous execution and active vertex set support efficiently would require cycle-level micro-architectural modeling of complex synchronization mechanisms and special data structures. Using a high-level model (e.g. SystemC) is not much different from RTL for such complex modules, and hence the benefit of HLS tools would be limited. Hence, we chose not to implement them because they are not representative of a typical HLS-based design flow.

Note that designing each accelerator using low-level RTL models requires too much design effort and is not in line with the objectives of this paper. Because of this reason, the application-specific accelerators designed using the HLS methodology of this section will be used as a baseline in our experiments (Section VI).

IV. PROPOSED ARCHITECTURE

The preliminary version of this paper has proposed several microarchitectural features to achieve both high throughput and high work efficiency for asynchronous execution of graph applications [26]. The basic idea is to allow processing tens/hundreds of vertices/edges to be able to hide long access latencies to main memory. The potential hazards and race conditions between simultaneously executed vertices/edges are handled through special mechanisms in the architecture.

Figure 2 shows the internal architecture of a single accelerator unit proposed. This is a loosely-coupled accelerator connected to the system DRAM directly. As shown in the figure, the accelerator architecture consists of several components, which will be explained here briefly. Active List Manager (ALM) is responsible for keeping the set of active vertices that need to be processed before convergence. Runtime Unit (RT) receives vertices from ALM and schedules them for execution based on the resource availability in the system. RT sends the next vertex to Sync Unit (SYU) to start its execution. SYU is responsible for making sure that all edges and vertices are processed such that sequential consistency is guaranteed. SYU checks and avoids the potential read-after-write (RAW) and write-after-read (WAR) hazards. Then, SYU sends the vertices to the Gather Unit (GU), which is the starting point of vertex program execution. It executes the gather operation as will be discussed in Section V-B. An important feature of GU is that it can process tens of vertices and hundreds of edges to hide long access latencies to the system memory. It can also switch between many small-degree vertices and few large-degree vertices for dynamic load balancing. After GU is done with the gather operation of a vertex, its state data is sent to the Apply Unit (APU), which performs the main computation for the vertex. After APU is done, vertex data is passed to the Scatter Unit where the scatter operation (Section V-B) will be done. In this stage, the neighboring vertices can be activated (i.e. inserted into the active list) based on application-specific conditions. Similar to GU, SCU also processes tens/hundreds of vertices/edges concurrently. In addition to the computational modules, there is a special memory subsystem, consisting of caches for different data types, specifically optimized for graph applications.

Readers can refer to the preliminary version [26] for low-level details of the proposed architecture. In this extended version, the main focus is on the design automation aspects that could not be included in the preliminary version.
V. Template-Based Methodology

In this section, we describe our template-based design methodology to implement accelerators for irregular graph applications. This template is created only once and is utilized across many applications. Hence, it is affordable to incorporate important architectural features into this template such as asynchronous execution support, lightweight synchronization mechanisms, and active vertex set, as described in Section IV. While these features significantly improve energy efficiency of the synthesized accelerators, the template-based design methodology reduces the design effort substantially.

A. Design flow

The proposed design flow is shown in Figure 3. Observe that there is a clear separation between the template implementation and the application specific models in this methodology. In Figure 3, the dark gray color corresponds to the user code, and the light gray corresponds to the template code and the automatically generated models in the design flow. The template code comes with functional and performance SystemC models. The functional model does not include microarchitectural details of the final implementation and thus is not representative of the performance of an accelerator. It is primarily used to validate functionality of the user application. The performance model is a cycle-accurate SystemC model that has all the microarchitectural details of the template modules, except the user-defined functions. Section V-D provides more details about these two models.

In the proposed design flow, the user specializes the template by providing application specific data types and implementation of some predefined functions in C language (see Section V-B for details). As an example, the user-specified code is less than 50 lines for the PageRank application. In addition, the user is responsible for providing HLS directives for the application specific functions to specify operations such as loop unrolling and pipelining. These directives can be specified as pragmas in the user source code. It is important to point out that the user has no access to the template code during this process, hence (s)he only needs to deal with the short snippets of the application specific code.

Once the template is specialized, an HLS flow is used to generate RTL for the user-defined functions for timing characterization. After the latency and throughput values for each user function are computed, they are back-annotated into the performance models to produce a system-level performance model for this accelerator. This model is then used for design space exploration, details of which will be given in Section V-E. After automatic tuning of the template parameters, the implementation-ready SystemC models can be synthesized to produce RTL using a standard HLS flow. During synthesis, the unused parts of our template are automatically discarded.

B. Programming Interface & Data Structures

The proposed template creates an abstract representation for the application program and graph data. For our programming interface, we have adopted the well known Gather-Apply-Scatter (GAS) model from [20] which is originally designed for large-scale distributed graph processing. In this model, the user needs to divide the vertex program into three logical building blocks as follows:

- **Gather**: The neighboring edges and/or vertices of the current vertex are processed to compute an accumulated value in this stage. The user needs to provide an accumulation function corresponding to one neighbor only.
- **Apply**: Main computation is done for the current vertex in this stage by using the accumulated value from the Gather stage. The user provides the compute function for the vertex.
- **Scatter**: The neighbor vertices are activated if necessary and the data computed in **Apply** stage is distributed to the neighboring vertices and/or edges. The user needs to provide the scatter function for a single neighbor only.

In addition to the three logical blocks, the programming interface includes helper functions. Both **Gather** and **Scatter** have two helper functions each: **GatherInit** and **ScatterInit** functions are used to initialize the corresponding states for the **Gather** and **Scatter** stages; **GatherFinish** and **ScatterFinish** functions update the associated data for the current vertex and finalize the states for the **Gather** and **Scatter** stages.

To store the graph data in memory and to facilitate communication between computational blocks, the template has eight different data types: **PrivateVertexData** stores data associated with a vertex that can be accessed by only the corresponding vertex; **SharedVertexData** stores data associated with a vertex that can be accessed by the neighboring vertices; **VertexData** is a combination of **PrivateVertexData** and **SharedVertexData**; **EdgeDataG** corresponds to the data associated with an incoming edge of a vertex and is used in the Gather stage; **EdgeDataS** corresponds to the data associated with an outgoing edge of a vertex and is used in the Scatter stage; **GatherState** stores the intermediate state used in the Gather iterations and it is also passed to the Apply stage; **ApplyState** is the state computed...
After initializing the page rank value for the current vertex (Line 1), the algorithm starts by accumulating the rank value (Line 2). This process involves gathering data from neighboring vertices (Line 3) and then applying the rank update formula (Line 4). The rank update formula is given by:

\[
\text{newRankScaled} = \frac{\text{doScatter} \times \text{newRankScaled} + \beta \times \text{gst.accumPR}}{1 + \beta}
\]

where \( \text{doScatter} \) is a Boolean variable indicating whether a scatter operation is needed for a vertex, \( \beta \) is the rank scaling coefficient, and \( \text{gst.accumPR} \) is the accumulated rank value from the gather stage.

If the new rank value exceeds the error threshold (Line 5), the scatter operation is triggered (Line 6). This process continues until all vertices have been processed.

The proposed template architecture is configurable through the SystemC template. This allows for the exploration of different parameter settings for the accelerator, such as the number of vertices/edges processed simultaneously and the cache sizes. These parameters can be varied to optimize performance for different applications.

The architectural template is validated through simulation and experimentation. The performance is estimated using SystemC simulation and cycle-accurate models. The cycle-accurate model is used to estimate actual performance of the accelerator. The SystemC simulation provides an efficient method to explore the design space, allowing for the identification of the optimal parameter settings for different applications.

The proposed template architecture is configurable through microarchitectural parameters. It is possible to vary these parameters for an application and observe the effects on performance by running SystemC simulation of the performance model. Since the exploration space is large, we also propose an automatic design space exploration framework that will be explained in the next subsection.

### E. Design Space Exploration

There are several microarchitectural parameters in the template proposed. The first set of parameters is the number of vertices/edges that can be processed concurrently in the gather/scatter stages. As mentioned earlier, processing multiple vertices/edges in an accelerator unit allows memory level parallelism to hide the DRAM access latencies. However, a temporary execution state needs to be stored locally in the accelerator unit corresponding to each vertex and edge processed. In other words, increasing these parameter values can improve performance in exchange for a larger local storage. The other set of parameters are the cache sizes corresponding to different graph data types. If a certain data type has high access locality for an application, it makes sense to increase the corresponding cache size to improve performance in exchange for extra area and power.
Design_Space_Exploration()

1. Metric $m_1 \leftarrow \text{maximize } T/A$
2. $(T_0, A_0) \leftarrow \text{OptimizeParams}(m_1)$
3. $i \leftarrow 1$
4. for each $\alpha_i$ in the input parameter set do
5. Define metric $m_2 \text{: maximize } (T/T_0 - \alpha_i A/A_0)$
6. $(T_i, A_i) \leftarrow \text{OptimizeParams}(m_2)$
7. Add $(T_i, A_i)$ to the Pareto curve
8. $i \leftarrow i + 1$
9. return Pareto curve

Figure 7: The high-level algorithm for design space exploration.

The tradeoff between performance and area/power is application specific and depends on different factors. For example, consider an application for which the main performance bottleneck is the gather operation (e.g., PageRank). It makes sense to increase the number of vertices/edges that can be processed concurrently in the gather stage for such an application. On the other hand, if the VertexData accessed during the gather operation has poor memory access locality, it makes sense to keep the corresponding cache size small to save area and power. The exact parameter values chosen should depend on both the application characteristics and the desired tradeoffs. In particular, the designers need to consider how much extra area/power they are willing to pay for a certain amount of increase in performance.

When all the microarchitecture parameters are considered, the exploration space is quite large, and tuning the parameters requires an inherent knowledge of both the application characteristics and the template microarchitecture. We propose an automatic design space exploration methodology to shield the designers from these details. For a given application, our methodology generates a Pareto curve between performance and area/power at different design points. After that, the designer can choose the desired tradeoff and the corresponding parameters for the accelerator.

In the rest of this section, we use throughput (i.e. number of vertices or edges processed per second) as proxy for performance and the hardware area as proxy for power. However, our methodology is applicable for different metrics as well. To estimate the throughput of an accelerator for a specific set of parameters, we run our performance simulator long enough such that the effect of the warm-up period on the measured performance is negligible. To estimate the area corresponding to a set of parameters, we first characterize the area impact of each parameter by running synthesis (for compute unit parameters) and using the available cache models (for cache size parameters).

For the purpose of generating a Pareto curve, it makes sense to maximize a linear function of throughput ($T$) and area ($A$) such as: $T - \alpha A$. By varying the $\alpha$ parameter, one can generate different design points with different tradeoffs. The advantage of such a formulation is that it has a well-defined mathematical property: The slope of the Pareto curve at a particular point must be equal to the $\alpha$ value used to generate that point. However, the disadvantage is that such a function is not intuitive because $T$ and $A$ have different units and it is not clear how to choose the range of $\alpha$ values.

For this reason, we propose an exploration methodology, as shown in Figure 7. In this methodology, we use an iterative optimization algorithm (denoted as OptimizeParams in the figure) to compute the parameters that maximize a given objective metric. This is a greedy algorithm that starts from an initial design point, evaluates an incremental change for each parameter and chooses the change that maximizes the given metric at every iteration. The iterations continue until the improvement obtained is below a certain threshold. Our experiments have demonstrated that such a greedy algorithm is sufficient in our exploration framework. The architectural parameters we considered in this framework are as follows: the number of concurrently processed vertices and edges in the gather and scatter stages (four different parameters), the sizes of the caches for VertexInfo, VertexData, EdgeInfo, and EdgeData (four different parameters).

In the first step of the proposed exploration methodology in Figure 7, the well-known power-delay-product metric\(^1\) is used to compute the reference values for throughput and area, denoted as $T_0$ and $A_0$. Then, we perform a number of iterations to generate a Pareto curve around the reference point. For this, we use a linear optimization metric where each term is normalized with respect to the reference values. The $\alpha$ value in this metric determines the tradeoff between the change in throughput and area. In our experiments, the $\alpha$ value is varied between 0 and 5. Here, $\alpha = 0$ is one extreme, where the objective is to maximize throughput without considering the impact on area, whereas $\alpha = 5$ is the other extreme, where every $x\%$ increase in area must be compensated by at least $5x\%$ improvement in throughput. By normalizing throughput and area with their reference values, we believe that the optimization metric and the meaning of the $\alpha$ parameter has become more intuitive for the users.

Note that each $\alpha$ value considered in Figure 7 generates a point on the Pareto curve. The Pareto curve generated for the PageRank accelerator is shown in Figure 12 as an example.

VI. EXPERIMENTAL STUDY

A. Applications

In our experiments, we have evaluated three graph applications: Single-Source Shortest Path (SSSP), Stochastic Gradient Descent (SGD), and PageRank (PR). Brief descriptions of SSSP and SGD are given below, while the detailed description of PR was given in Section V-C.

Single Source Shortest Path (SSSP): The distance values of all vertices are initialized to be infinity except for the source vertex, which has distance of zero by definition. A vertex updates its distance value based on the distance values of its neighbors using the formula in expression (1). In our template implementation, the distance value of a vertex is stored in the corresponding vertex data. When the distance value of a vertex is changed, the neighbors of the vertex are activated.

$$
\text{dist}^{i+1}(v) = \min_{(u,v) \in E} \text{dist}(u) + \text{weight}(u,v)
$$

\(^1\)As mentioned earlier, we use area as proxy for power and 1/throughput as proxy for delay in this section. Hence the actual maximization metric becomes $T/A$. 

\[0278-0070\) (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Table I: Datasets used in our experiments.

<table>
<thead>
<tr>
<th>Application</th>
<th>Dataset</th>
<th># Vert.</th>
<th># Edges</th>
</tr>
</thead>
<tbody>
<tr>
<td>PageRank</td>
<td>wg</td>
<td>916K</td>
<td>5.1M</td>
</tr>
<tr>
<td>SSSP (Directed)</td>
<td>pk</td>
<td>1.6M</td>
<td>30M</td>
</tr>
<tr>
<td></td>
<td>lj</td>
<td>4.8M</td>
<td>69M</td>
</tr>
<tr>
<td></td>
<td>g24</td>
<td>16.8M</td>
<td>268M</td>
</tr>
<tr>
<td>SGD (Undirected)</td>
<td>1M</td>
<td>9.7K</td>
<td>1M</td>
</tr>
<tr>
<td></td>
<td>10M</td>
<td>80K</td>
<td>10M</td>
</tr>
</tbody>
</table>

Stochastic Gradient Descent (SGD): SGD is a matrix factorization technique which is used in recommender systems. The aim of SGD is to compute a feature vector for each user and item based on the ratings provided as training data. Then, using these feature vectors, the missing ratings between other users and items can be estimated. This application takes a bipartite graph as input, where the vertices correspond to users and items, respectively, and the edges correspond to the given ratings in the training data. The algorithm starts with a random feature vector for each vertex, and it iteratively updates them using expression (2) and (3). Here, the estimation error for rating $r_{ui}$ is denoted as $e_{ui}$ in (2), and the feature vector for vertex $u$ ($v_{ui}$) is updated based on the gradient function in (3). These functions are implemented in the gather stage of our template, where the gradient value for the current vertex is accumulated over all neighbors. Gather also keeps track of the maximum error observed, and the neighbors are updated in the scatter stage if the error is above a certain threshold.

$$e_{ui} = r_{ui} - \langle v_{ui}, v_i \rangle$$  \hspace{1cm} (2)

$$v_{ui} = v_{ui} + \gamma (e_{ui}v_i - \lambda v_{ui})$$  \hspace{1cm} (3)

B. Experimental Setup

We have compared the performance of the generated accelerators with manually implemented HLS accelerators and a state of the art Ivy Bridge system. Details of the execution environments are as follows:

1) CPU: The CPU measurements are collected on a 2-socket 24-core Ivy Bridge server with the following cache sizes per socket: 768KB of private L1, 3MB of private L2, and 30MB of shared L3 cache. The total DRAM size is 132GB. For both PageRank and SSSP, we used the implementations from the Berkeley GAP benchmark [2]. Furthermore, we improved the PageRank implementation by adding a bit vector to keep track of the active vertices to improve convergence. For the SGD application, we used DSGD implementation from [14] since it is one of the best implementations available as stated in [29]. For all applications, we have used gcc 4.9.1 and enabled -O3 level optimizations. During execution, we set NUMA policy to distribute memory allocation for the graph to maximize the memory bandwidth utilization.

2) Template Acc: The generated accelerators are created by the proposed template-based methodology (Section V). The accelerators for all applications are composed of 4 Accelerator Units (AUs) and are customized per application.

3) HLS Acc: The accelerators are generated by the HLS flow from a manually-coded SystemC description (Section III). The SystemC code is implemented for each application independently and the memory subsystem code is reused. The memory subsystem consists of a simple load/store mechanism for each graph data type and supports the following features: unpacking/packing of streaming graph data types (e.g. EdgeInfo and VertexInfo) to/from cache lines; up to 128 outstanding cache line requests from/to the main memory, and the round-robin-based arbitration of multiple requests/responses from load/store units from/to the main memory.

Datasets: In our experiments, we have used datasets which are either obtained from well-known graph databases or generated by widely used tools. For example, PageRank and SSSP work on directed graphs. We have selected WebGoogle(wg), soc-Pokec(pk) and soc-LiveJournal(lj) datasets from SNAP [18] graph database. Additionally, we have generated a large graph (g24) by using the Graph500 [3] tool.

For SGD, two movie datasets are used from MovieLens [4] which have 1 million and 10 million movie ratings. Details of the selected graphs can be found in Table I.

C. Experimental Results

1) Estimation methodology

For the CPU experiments, we have run the applications on the native system and measured the runtime using OpenMP functions and the CPU power consumption using Running Average Power Limit (RAPL) [5] framework, which provides core and uncore power consumption values by reading the MSR registers. Since we are using a DDR4 memory model for our accelerators, we have provided DDR4 power consumption values for the baseline CPU system to make a fair comparison. For this purpose, we have generated DDR4 access traces that correspond to the same DDR3 bandwidth utilization of the base CPU system and fed them to the DRAMSim2 [28] tool.

For both the proposed template performance model and the application specific HLS baseline model, a commercial HLS tool was used to generate RTL from the SystemC models. We used an industrial 22nm technology library for standard cells and metal layers. The RTL is synthesized using a commercial physical-aware logic synthesis tool to produce the gate-level netlist as well as the timing reports that include the wire delay estimations. All accelerators operate at 1GHz frequency, and all designs can satisfy the timing constraints. For power estimations, the switching activity for all inputs and sequential elements are saved in SAIF format during RTL simulation. Then, a commercial power analysis tool is used to measure the static and dynamic power for the given switching activity files.

2) Power, Performance and Area Results

To estimate the power consumption of memory blocks in the template architecture, we have used well-known simulators CACTI [1] and DRAMSim2 [28]. We have used CACTI for estimating the power consumption of caches; however since our accelerator is synthesized for 22nm, we have scaled down the area and power values generated by CACTI from 32nm to 22nm. For scaling area, coefficient of 0.5 is used [9, 10], whereas for scaling power, coefficients of 0.569[16] (dynamic) and 0.8 [32] (leakage) are used. For DRAM power consumption of both template-based and the HLS accelerators, we have integrated the DRAMSim2 tool into our simulators and used the aforementioned DDR4 model for power and timing estimations.

Figure 8 reports the throughput of computation for the template and HLS accelerators, and the 24-core CPU executions.
in terms of the number of graph edges processed per second. In Figure 9, the corresponding total execution times are given. The values in these figures are reported for each application-dataset pair and are normalized with respect to the corresponding 1-core Xeon CPU execution.

The results in these figures confirm that the edge throughput metric does not determine the performance by itself. Although both the template and HLS accelerators have comparable levels of edge throughput, the template accelerators are faster by a factor of up to 19x in terms of the total execution time. As was discussed in Section II, the total runtime depends not only on the throughput of computation but also on the number of edges processed until convergence [20, 25]. Since the HLS accelerators do not support asynchronous execution and active vertex set, they end up processing many more edges until convergence as will be shown in Table II and explained below.

Power consumption comparison is shown in Figure 10. Power consumption for both the template and the HLS accelerators are dominated by the DRAM power. The computational units take less than 3% of the power values reported. Both accelerator implementations have 17-68x times less power consumption than the 24-core CPU runs. Although the power consumption of the template and HLS accelerators are similar, the template accelerators are much more energy efficient when the total execution times are taken into account. In particular, the template accelerators are up to 18x and 69x more energy efficient than the HLS accelerators and the 24-core CPU, respectively, as shown in Figure 11.

In Table II, we provide a summary of how the template accelerators compare to the HLS-based accelerators in terms of area, energy consumption, and total work metrics for all three applications. Area reported includes all the computational units, caches for the template accelerator and the memory subsystem for the HLS-based accelerator. Energy consumption and the total work metrics are measured for the largest graphs (g24 for PR, SSSP and 10M for SGD). The total work metric (the last row of Table II) is reported as the number of edges processed until convergence divided by the number of edges in the graph. It can be observed from the table that the template accelerators are more energy efficient than the HLS accelerators due to better work efficiency. Note that higher work efficiency is due to the architectural support for asynchronous execution and active vertex set. These extra features lead to 2-5 times larger areas for the template accelerators. Nevertheless, the accelerator areas are orders of magnitude smaller than the CPU area.

3) Design Space Exploration
To study the effectiveness of our design space exploration methodology, we have selected the PR application for case study. Pareto curve generated by the heuristic shown in Figure 7 is given in Figure 12. In this experiment, we swept the $\alpha$ values between 0 and 5.
When we consider $\alpha = 0$, area does not have any impact on the metric $T - \alpha \cdot A$, thus we expect that only throughput will be optimized. On the other hand, larger $\alpha$ values increase the significance of the area term. Smaller area means smaller buffers and caches, and this causes two problems: (1) smaller number of edge slots and vertex rows imply less parallelism in GU and SCU; and it is harder to hide memory access latency, (2) smaller caches increase the number of memory accesses and impose higher memory latency. Therefore, we observe smaller throughput for large $\alpha$ values.

While high throughput values are desirable, different projects may have different design constraints. As shown in Figure 12, our heuristic provides a range of design points that a user can select based on specific constraints. However, we can say that design points that reside in the lower right corner of this figure would represent most desirable points in practice.

4) Design Effort Discussion
In Table III, we report the number of lines used to implement an application as a proxy for the effort invested in the accelerator design. For the template methodology, it includes all application functions (gather, scatter, apply) and the application data types. For HLS, we include the application specific code excluding the memory subsystem. We also provide the size of the code that goes in the gather/apply/scatter stages and explore various parameters as shown in the paper. The rest of the architecture is captured in SystemC with low-level hardware specific optimizations, which does not have to be modified by an algorithm developer. For an ASIC implementation, hardware designers may still need to be involved in the later stages of the design implementation process such as logic synthesis and physical design. In addition to the ASIC-based design that the proposed methodology primarily targets, the architecture presented can also be used in reconfigurable hardware platforms such as FPGAs. In this case, the software engineer can manage the entire accelerator design process by specifying the gather/apply/scatter stages and explore various optimizations specific to irregular graph applications such as synchronization, communication, latency tolerance, worklist maintenance, etc. into the proposed template. Our results show that these optimizations can generate up to 18x better energy efficiency compared to the accelerators generated using direct HLS, while the design effort is significantly lower.

In this paper, we propose a template-based high-level design methodology specifically targeted at designing hardware accelerators for graph applications. The main idea is to provide a simple interface for design engineers, who can model a graph application by only defining the basic data structures and high-level functions. While the user-level models are defined per vertex and edge, the parallel hardware execution is handled by the architectural template that is common for many graph applications.

The proposed methodology allows a software designer to be directly involved in the accelerator design. Since the algorithm-specific functions gather/apply/scatter used in SystemC are captured in plain C++ language, a software designer can perform algorithm development and exploration by choosing what goes in the gather/apply/scatter stages and explore various parameters as shown in the paper. The rest of the architecture is captured in SystemC with low-level hardware specific optimizations, which does not have to be modified by an algorithm developer. For an ASIC implementation, hardware designers may still need to be involved in the later stages of the design implementation process such as logic synthesis and physical design. In addition to the ASIC-based design that the proposed methodology primarily targets, the architecture presented can also be used in reconfigurable hardware platforms such as FPGAs. In this case, the software engineer can manage the entire accelerator design process by specifying the gather/apply/scatter functions in C++ language and use FPGA-specific HLS tools to generate the FPGA bitstream.

In this paper, we show that our template-based methodology can reduce the design time significantly compared to conventional HLS. Furthermore, since the template is amortized over many applications, we can afford to incorporate complex optimizations specific to irregular graph applications such as synchronization, communication, latency tolerance, worklist maintenance, etc. into the proposed template. Our results show that these optimizations can generate up to 18x better energy efficiency compared to the accelerators generated using direct HLS, while the design effort is significantly lower.

REFERENCES

Table III: Lines of code

<table>
<thead>
<tr>
<th></th>
<th>PageRank</th>
<th>SSSP</th>
<th>SGD</th>
</tr>
</thead>
<tbody>
<tr>
<td>Application code for template</td>
<td>43</td>
<td>34</td>
<td>76</td>
</tr>
<tr>
<td>Application code for HLS</td>
<td>735</td>
<td>1333</td>
<td>1423</td>
</tr>
<tr>
<td>Template: functional model</td>
<td>753</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Template: performance model</td>
<td>38650</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>


