Extending Kompute with Custom C++ Operations

Kompute provides an extenisble architecture which allows for the core components to be extended by building custom operations.

Building operations is intuitive however it requires knowing some nuances around the order in which each of the class functions across the operation are called as a sequence is executed.

These nuances are important for more advanced users of Kompute, as this will provide further intuition in what are the specific functions and components that the native functions (like OpTensorCreate, OpAlgoBase, etc) contain which define their specific behaviour.

Flow of Function Calls

The top level operation which all operations inherit from is the kp::OpBase class. Some of the “Core Native Operations” like kp::OpTensorCopy, kp::OpTensorCreate, etc all inherit from the base operation class.

The kp::OpAlgoBase is another base operation that is specifically built to enable users to create their own operations that contain custom shader logic (i.e. requiring Vulkan Compute Pipelines, DescriptorSets, etc). The next section contains an example which shows how to extend the OpAlgoBase class.

Below you

Function

Description

OpBase(…, tensors, freeTensors)

Constructor for class where you can load/define resources such as shaders, etc.

~OpBase()

Destructor that frees vulkan resources (if owned) which should be used to manage any memory allocations created through the operation.

init()

Init function gets called in the Sequence / Manager inside the record step. This function allows for relevant objects to be initialised within the operation.

record()

Record function that gets called in the Sequence / Manager inside the record step after init(). In this function you can directly record to the Vulkan command buffer.

preEval()

When the Sequence is Evaluated this preEval is called across all operations before dispatching the batch of recorded commands to the GPU. This is useful for example if you need to copy data from local to host memory.

postEval()

After the sequence is Evaluated this postEval is called across all operations. When running asynchronously the postEval is called when you call evalAwait(), which is why it’s important to always run evalAwait() to ensure the process doesn’t go into inconsistent state.

Simple Operation Extending OpAlgoBase

Below we show a very simple example that enables you to create an operation with a pre-specified shader. In this case it is the multiplication shader.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class OpMyCustom : public OpAlgoBase
{
  public:
    OpMyCustom(std::shared_ptr<vk::PhysicalDevice> physicalDevice,
           std::shared_ptr<vk::Device> device,
           std::shared_ptr<vk::CommandBuffer> commandBuffer,
           std::vector<std::shared_ptr<Tensor>> tensors)
      : OpAlgoBase(physicalDevice, device, commandBuffer, tensors, "")
    {
        // Perform your custom steps such as reading from a shader file
        this->mShaderFilePath = "shaders/glsl/opmult.comp";
    }
}


int main() {

    kp::Manager mgr; // Automatically selects Device 0

    // Create 3 tensors of default type float
    auto tensorLhs = std::make_shared<kp::Tensor>(kp::Tensor({ 0., 1., 2. }));
    auto tensorRhs = std::make_shared<kp::Tensor>(kp::Tensor({ 2., 4., 6. }));
    auto tensorOut = std::make_shared<kp::Tensor>(kp::Tensor({ 0., 0., 0. }));

    // Create tensors data explicitly in GPU with an operation
    mgr.evalOpDefault<kp::OpTensorCreate>({ tensorLhs, tensorRhs, tensorOut });

    // Run Kompute operation on the parameters provided with dispatch layout
    mgr.evalOpDefault<kp::OpMyCustom>(
        { tensorLhs, tensorRhs, tensorOut });

    // Prints the output which is { 0, 4, 12 }
    std::cout << fmt::format("Output: {}", tensorOutput.data()) << std::endl;
}

More Complex Operation Extending OpAlgoBase

Below we show a more complex operation that performs the following:

  • Expects three tensors for an operation, two inputs and one output

  • Expects the tensors to be initialised

  • Checks that the tensors are of the same size

  • Expects output tensor to be of type TensorTypes::eDevice (and creates staging tensor)

  • Has functionality to read shader from file or directly from spirv bytes

  • Records relevant bufferMemoryBarriers

  • Records dispatch command

  • Records copy command from device tensor to staging output tensor

  • In postEval it maps data from staging tensor to output tensor’s data

For starters, the header file contains the functions that will be overriden:

#pragma once

#include <fstream>

#include "kompute/Core.hpp"

#include "kompute/Algorithm.hpp"
#include "kompute/Tensor.hpp"

#include "kompute/operations/OpAlgoBase.hpp"

namespace kp {

/**
 * Operation base class to simplify the creation of operations that require
 * right hand and left hand side datapoints together with a single output.
 * The expected data passed is two input tensors and one output tensor.
 */
class OpAlgoLhsRhsOut : public OpAlgoBase
{
  public:
    /**
     *  Base constructor, should not be used unless explicitly intended.
     */
    OpAlgoLhsRhsOut();

    /**
     * Default constructor with parameters that provides the bare minimum
     * requirements for the operations to be able to create and manage their
     * sub-components.
     *
     * @param physicalDevice Vulkan physical device used to find device queues
     * @param device Vulkan logical device for passing to Algorithm
     * @param commandBuffer Vulkan Command Buffer to record commands into
     * @param tensors Tensors that are to be used in this operation
     * @param freeTensors Whether operation manages the memory of the Tensors
     * @param komputeWorkgroup Optional parameter to specify the layout for processing
     */
    OpAlgoLhsRhsOut(std::shared_ptr<vk::PhysicalDevice> physicalDevice,
           std::shared_ptr<vk::Device> device,
           std::shared_ptr<vk::CommandBuffer> commandBuffer,
           std::vector<std::shared_ptr<Tensor>> tensors,
           const Workgroup& komputeWorkgroup = {});

    /**
     * Default destructor, which is in charge of destroying the algorithm
     * components but does not destroy the underlying tensors
     */
    virtual ~OpAlgoLhsRhsOut() override;

    /**
     * The init function is responsible for ensuring that all of the tensors
     * provided are aligned with requirements such as LHS, RHS and Output
     * tensors, and  creates the algorithm component which processes the
     * computation.
     */
    virtual void init() override;

    /**
     * This records the commands that are to be sent to the GPU. This includes
     * the barriers that ensure the memory has been copied before going in and
     * out of the shader, as well as the dispatch operation that sends the
     * shader processing to the gpu. This function also records the GPU memory
     * copy of the output data for the staging buffer so it can be read by the
     * host.
     */
    virtual void record() override;

    /**
     * Executes after the recorded commands are submitted, and performs a copy
     * of the GPU Device memory into the staging buffer so the output data can
     * be retrieved.
     */
    virtual void postEval() override;

  protected:
    // -------------- NEVER OWNED RESOURCES
    std::shared_ptr<Tensor> mTensorLHS; ///< Reference to the parameter used in the left hand side equation of the shader
    std::shared_ptr<Tensor> mTensorRHS; ///< Reference to the parameter used in the right hand side equation of the shader
    std::shared_ptr<Tensor> mTensorOutput; ///< Reference to the parameter used in the output of the shader and will be copied with a staging vector
};

} // End namespace kp

Then the implementation outlines all the implementations that perform the actions above:

#pragma once

#include "kompute/operations/OpAlgoLhsRhsOut.hpp"

namespace kp {

OpAlgoLhsRhsOut::OpAlgoLhsRhsOut()
{
    KP_LOG_DEBUG("Kompute OpAlgoLhsRhsOut constructor base");
}

OpAlgoLhsRhsOut::OpAlgoLhsRhsOut(
  std::shared_ptr<vk::PhysicalDevice> physicalDevice,
  std::shared_ptr<vk::Device> device,
  std::shared_ptr<vk::CommandBuffer> commandBuffer,
  std::vector<std::shared_ptr<Tensor>> tensors,
  const Workgroup& komputeWorkgroup)
  // The inheritance is initialised with the copyOutputData to false given that
  // this depencendant class handles the transfer of data via staging buffers in
  // a granular way.
  : OpAlgoBase(physicalDevice, device, commandBuffer, tensors, komputeWorkgroup)
{
    KP_LOG_DEBUG("Kompute OpAlgoLhsRhsOut constructor with params");
}

OpAlgoLhsRhsOut::~OpAlgoLhsRhsOut()
{
    KP_LOG_DEBUG("Kompute OpAlgoLhsRhsOut destructor started");
}

void
OpAlgoLhsRhsOut::init()
{
    KP_LOG_DEBUG("Kompute OpAlgoLhsRhsOut init called");

    if (this->mTensors.size() < 3) {
        throw std::runtime_error(
          "Kompute OpAlgoLhsRhsOut called with less than 1 tensor");
    } else if (this->mTensors.size() > 3) {
        KP_LOG_WARN(
          "Kompute OpAlgoLhsRhsOut called with more than 3 this->mTensors");
    }

    this->mTensorLHS = this->mTensors[0];
    this->mTensorRHS = this->mTensors[1];
    this->mTensorOutput = this->mTensors[2];

    if (!(this->mTensorLHS->isInit() && this->mTensorRHS->isInit() &&
          this->mTensorOutput->isInit())) {
        throw std::runtime_error(
          "Kompute OpAlgoLhsRhsOut all tensor parameters must be initialised. "
          "LHS: " +
          std::to_string(this->mTensorLHS->isInit()) +
          " RHS: " + std::to_string(this->mTensorRHS->isInit()) +
          " Output: " + std::to_string(this->mTensorOutput->isInit()));
    }

    if (!(this->mTensorLHS->size() == this->mTensorRHS->size() &&
          this->mTensorRHS->size() == this->mTensorOutput->size())) {
        throw std::runtime_error(
          "Kompute OpAlgoLhsRhsOut all tensor parameters must be the same size "
          "LHS: " +
          std::to_string(this->mTensorLHS->size()) +
          " RHS: " + std::to_string(this->mTensorRHS->size()) +
          " Output: " + std::to_string(this->mTensorOutput->size()));
    }

    KP_LOG_DEBUG("Kompute OpAlgoLhsRhsOut fetching spirv data");

    std::vector<uint32_t> shaderFileData = this->fetchSpirvBinaryData();

    KP_LOG_DEBUG("Kompute OpAlgoLhsRhsOut Initialising algorithm component");

    this->mAlgorithm->init(shaderFileData, this->mTensors);
}

void
OpAlgoLhsRhsOut::record()
{
    KP_LOG_DEBUG("Kompute OpAlgoLhsRhsOut record called");

    // Barrier to ensure the data is finished writing to buffer memory
    this->mTensorLHS->recordBufferMemoryBarrier(
      this->mCommandBuffer,
      vk::AccessFlagBits::eHostWrite,
      vk::AccessFlagBits::eShaderRead,
      vk::PipelineStageFlagBits::eHost,
      vk::PipelineStageFlagBits::eComputeShader);
    this->mTensorRHS->recordBufferMemoryBarrier(
      this->mCommandBuffer,
      vk::AccessFlagBits::eHostWrite,
      vk::AccessFlagBits::eShaderRead,
      vk::PipelineStageFlagBits::eHost,
      vk::PipelineStageFlagBits::eComputeShader);

    this->mAlgorithm->recordDispatch(this->mKomputeWorkgroup[0],
                                     this->mKomputeWorkgroup[1],
                                     this->mKomputeWorkgroup[2]);

    // Barrier to ensure the shader code is executed before buffer read
    this->mTensorOutput->recordBufferMemoryBarrier(
      this->mCommandBuffer,
      vk::AccessFlagBits::eShaderWrite,
      vk::AccessFlagBits::eTransferRead,
      vk::PipelineStageFlagBits::eComputeShader,
      vk::PipelineStageFlagBits::eTransfer);

    if (this->mTensorOutput->tensorType() == Tensor::TensorTypes::eDevice) {
        this->mTensorOutput->recordCopyFromDeviceToStaging(this->mCommandBuffer,
                                                           true);
    }
}

void
OpAlgoLhsRhsOut::postEval()
{
    KP_LOG_DEBUG("Kompute OpAlgoLhsRhsOut postSubmit called");

    this->mTensorOutput->mapDataFromHostMemory();
}

}