Skip to content
Snippets Groups Projects
1-heterocpp.rst 10.17 KiB

Writing a Hetero-C++ Program and Compiling it with HPVM

We will write a simple Matrix Multiplication kernel to illustrate how to compile a program through HPVM to target CPU, GPU or the FPGA. The implementation will be written using Hetero-C++, a parallel dialect of C/C++ which describes hierarchical Task-level and Data-level parallelism and compiles through HPVM.

Writing a Program in Hetero-C++

We start with a scalar implementation of Matrix Multiplication written in C++.

#include <heterocc.h>

void matmul(int *Res, std::size_t Res_Size, int *V1, std::size_t V1_Size,
            int *V2, std::size_t V2_Size, std::size_t left_dim,
            std::size_t right_dim, std::size_t common_dim) {
  for (int i = 0; i < left_dim; i++) {
    for (int j = 0; j < right_dim; j++) {
      Res[i * right_dim + j] = 0;
      for (int k = 0; k < common_dim; k++) {
        // Res[i,j] += V1[i,k] + V2[k,j]
        Res[i * right_dim + j] +=
            V1[i * common_dim + k] * V2[k * right_dim + j];
      }
    }
  }
}

Here we can see that this implementation is described using a 3 level loop nest, where the outer two loops iterate over index variables i and j and the innermost loop iterates over loop index variable k. We can see that the iterations over i and j are independent of each other and hence can be executed in parallel to each other. The inner-most loop however is performing a reduction operation and can not be paralleled with HPVM. We can describe the above information in Hetero-C++ with the __hetero_parallel_loop marker function, as shown below.

#include <heterocc.h>

void matmul(int *Res, std::size_t Res_Size, int *V1, std::size_t V1_Size,
            int *V2, std::size_t V2_Size, std::size_t left_dim,
            std::size_t right_dim, std::size_t common_dim) {
  for (int i = 0; i < left_dim; i++) {
    for (int j = 0; j < right_dim; j++) {
      __hetero_parallel_loop(
          /* Num Parallel Enclosing Loops */ 2,
          /* Num Input Pairs */ 6, Res, Res_Size, V1, V1_Size, V2, V2_Size,
          left_dim, right_dim, common_dim,
          /* Num Output Pairs */ 1, Res, Res_Size,
          /* Optional Node Name */ "matmul_parallel_loop");

      Res[i * right_dim + j] = 0;

      for (int k = 0; k < common_dim; k++) {
        // Res[i,j] += V1[i,k] + V2[k,j]
        Res[i * right_dim + j] +=
            V1[i * common_dim + k] * V2[k * right_dim + j];
      }
    }
  }
}

The marker describes that the 2 enclosing loops over i and j are parallel to each other. Additionally, it describes what are the inputs and outputs to the body of the loop which will be parallelized.

To complete the specification of the program, we add marker calls to __heterero_section_begin/end to denote that the region of code consists of at least one HPVM computational node. We additionally wrap the loop nest in __hetero_task_begin/end markers to nest the computation inside a node. This is needed to enable compilation to the GPU.