Getting Started

HazardFlow HDL allows users to modularly describe pipelined circuits with hazards, achieving cycle-level accuracy. It provides hazard interfaces and combinators to handle backward dependencies and simplify decomposition. By the end, users to design and compile hardware modules into Verilog.

Installation

Install rustup. After the installation is complete, restart the terminal.

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Clone the HazardFlow repository:

$ git clone https://github.com/kaist-cp/hazardflow.git
$ cd hazardflow

Compiling HazardFlow Modules to Verilog

First, build the Rust macros:

$ cargo build -p hazardflow-macro

To generate the Verilog code for 5-stage pipelined RISC-V CPU core:

# Generate a separate Verilog file for each submodule.
$ cargo run --release -- --target cpu --deadcode --wire-cache --system-task

# Generate an integrated Verilog file combining all submodules.
$ cargo run --release -- --target cpu --deadcode --wire-cache --merge --system-task

To generate the Verilog code for systolic-array based NPU core:

# Generate a separate Verilog file for each submodule.
$ cargo run --release -- --target gemmini --deadcode --wire-cache --system-task

# Generate an integrated Verilog file combining all submodules.
$ cargo run --release -- --target gemmini --deadcode --wire-cache --merge --system-task

The generated code is located in build, with each top-level module with a #[synthesize] attribute in separate directories.

Tutorial

We will introduce some basic concepts in the HazardFlow HDL programming model and use HazardFlow HDL to describe an FIR (finite impulse response) filter and a custom FIFO which does not allow duplicated entries.

HazardFlow HDL's implementation is based on the concept of hazard interface. Here, we will provide a brief overview of the interface and combinator to help you understand how to use them for implementing hardware modules. For more details, please refer to the Concepts section.

Hazard Interface

In HazardFlow HDL, a "hazard protocol" is defined as a communication protocol that includes a payload, a resolver, and a ready condition. For a given hazard protocol H, its payload, resolver, and ready condition are represented as H::P, H::R, and H::ready, respectively.

The hazard interface is defined as I<H> for a given hazard protocol H. It will be used as an ingress or egress type for the module.

// Hazard protocol.
trait Hazard {
    type P: Copy;
    type R: Copy;

    fn ready(p: Self::P, r: Self::R) -> bool;
}

// Hazard interface.
struct I<H: Hazard>;

Forward signals: the payload of the interface and its validity.

The payload signal (H::P) is data sent from the sender to the receiver.
The valid signal (bool) represents the validity of the payload signal.

Backward signal: the resolver of the interface.

The resolver signal (H::R) is data sent from the receiver to the sender. It includes information used to control transferability (e.g., the receiver cannot accept the incoming payload for now).

Ready condition: the indicator whether the receiver is ready to receive the payload.

The ready condition (H::ready) returns if the receiver is ready to receive with the current payload and resolver pair.
Logically, we say "transfer occurs" when the valid signal is true and the ready condition is true, which means the receiver is ready to receive the valid payload from the sender.

Example: valid-ready protocol

The valid-ready protocol, which is the most commonly used protocol for communication between modules in hardware, can be represented as follows.

struct ExampleVrH;

impl Hazard for ExampleVrH {
    type P = u32;
    type R = bool;

    fn ready(p: u32, r: bool) -> bool {
        r
    }
}

Module

A module has a state of type S and ingress and egress hazard interface types, I<IH> and I<EH>.

Conceptually, the module's behavior can be represented as a function of the following signature:

comb: impl Fn(Option<IH::P>, EH::R, S) -> (Option<EH::P>, IH::R, S)

It takes three input signals:

Ingress interface's forward signal (Option<IH::P>)
Egress interface's backward signal (EH::R)
Current state (S)

and returns three output signals:

Egress interface's forward signal (Option<EH::P>)
Ingress interface's backward signal (IH::R)
Next state (S)

Here, the forward signals are abstracted as Option<H::P>, where Some(p) indicates that the valid signal is true and the payload signal is p, while None indicates that the valid signal is false.

NOTE: A module can have zero or multiple hazard interfaces for its ingress or egress. For more details, please refer to the Interfaces section.

Combinator

Combinator is the idiomatic mechanism of attaching a module to an interface.

We define the generic combinator as a function fsm within each hazard interface in the HazardFlow HDL and it will compute the combinational logic each cycle:

impl<IH: Hazard> I<IH> {
    fn fsm<S: Copy, EH: Hazard>(
        self,
        init_state: S,
        comb: impl Fn(Option<IH::P>, EH::R, S) -> (Option<EH::P>, IH::R, S),
    ) -> I<EH> {
        ..
    }
}

It specifies the state type and ingress and egress interface types.

State type (S)
Ingress hazard interface type (I<IH>)
Egress hazard interface type (I<EH>)

Also, it accepts two arguments for representing the module's behavior:

Initial state when the circuit is reset (init_state)
Combinational logic representing the module's behavior (comb, as illustrated in the above)

For reusability, HazardFlow HDL provides a standard combinator library for developers. It defines many useful combinators (e.g., filter_map, mux, etc.), which are implemented as wrapper functions for generic FSM combinator. For more details, please refer to the Interface Combinators section.

FIR Filter

In this tutorial, we will use HazardFlow HDL to describe an FIR (finite impulse response) filter, which is very commonly used in digital signal processing applications.

Specification

The FIR filter of order N performs the following operation:

It receives input signals from the stream x and outputs the weighted sum of the most recent N+1 input signals to the stream y. It can be expressed with the following formula:

\[ y[n] = b_0 x[n] + b_1 x[n-1] + ... + b_N x[n-N] = \sum^{N}_{i=0} b_i x[n-i] \]

where \( x[n] \) and \( y[n] \) represent the input and output signals, \( N \) represents the filter order, and \( b_i \) represents the i-th filter coefficient.

For example, the IO signals of a FIR filter of order 2 with coefficients [4, 2, 3] are as follows:

n	x[n]	y[n]
0	1	4·1 + 2·0 + 3·0 = 4
1	4	4·4 + 2·1 + 3·0 = 18
2	3	4·3 + 2·4 + 3·1 = 23
3	2	4·2 + 2·3 + 3·4 = 26
4	7	4·7 + 2·2 + 3·3 = 41
5	0	4·0 + 2·7 + 3·2 = 20

For more details, please refer to Wikipedia.

Modular Design

We could represent the FIR filter of order 2 in modular way as follows:

As in the above figure, it can be divide into 3 submodules: window, weight, and sum.

window submodule:

It serves as a sliding window, always returning the latest 3 valid input signals as an array.
For example, if 1, 4, 3 are given as input signals, [1, 0, 0], [4, 1, 0], [3, 4, 1] will be returned as output signals.

weight submodule:

It keeps the weight vector [b0, b1, b2] persistent throughout the program.
It takes the input vector [v0, v1, v2] and returns the output vector [b0·v0, b1·v1, b2·v2].
This submodule can be simply represented by a map combinator.

sum submodule:

It takes the input vector and returns the sum of them as an output vector.

Implementation

Based on the above submodules, we can implement the FIR filter in a concise and modular way:

fn fir_filter(input: Valid<u32>) -> Valid<u32> {
    let weight = Array::<u32, 3>::from([4, 2, 3]);

    input
        .window::<3>()
        .map(|ip| ip.zip(weight).map(|(e, wt)| e * wt))
        .sum()
}

We can describe the FIR filter with window, map, and sum combinators in the HazardFlow HDL and we assume the input interface Valid<u32> is provided. Valid<u32>, which is an alias of I<ValidH<u32, ()>> is a valid interface where its payload is u32, the resolver is empty (), and its ready function always returns true. In other words, as long as the input interface's forward signal is Some(u32) at a specific clock cycle, the receiver receives a valid payload. We can interpret this input interface as a stream of signal values flowing through the wires.

window combinator:

The window combinator is defined as follows:

impl<P: Copy + Default> Valid<P> {
    fn window<const N: usize>(self) -> Valid<Array<P, N>> {
        self.fsm_map(P::default().repeat::<{ N - 1 }>(), |ip, s| {
            let ep = ip.repeat::<1>().append(s).resize::<N>();
            let s_next = ep.clip_const::<{ N - 1 }>(0);
            (ep, s_next)
        })
    }
}

It takes an Valid and returns Valid<Array<P, N>>. It tracks the latest N valid input signals. The fsm_map interface combinator is provided by the HazardFlow HDL standard library. It computes the egress payload and the next state based on the ingress payload and the current state, and updates the state when the ingress tranfser happens. The initial state is defined as P::default().repeat::<{ N - 1 }>() in our example. The anonymous function is where we specify the fsm logic from the (ingress payload, current state) to the (egress payload, next state).

map combinator:

The map combinator is used to represent the weight submodule.

map(|ip| ip.zip(weight).map(|(e, wt)| e * wt))

It takes an Valid<Array<u32, N>> and returns an egress hazard interface Valid<Array<u32, N>>. It transforms the i-th element of ingress payload ip[i] into ip[i] * weight[i], and leaves the resolver as untouched. The map interface combinator is provided by the HazardFlow HDL standard library. We can interpret it as stateless version of fsm_map. In the application-specific logic in map interface combinator, we use zip and map methods for manipulating the ingress payload signal.

sum combinator:

The sum combinator is defined as follows:

impl<const N: usize> Valid<Array<u32, N>> {
    fn sum(self) -> Valid<u32> {
        self.map(|ip| ip.fold(0, |acc, e| acc + e))
    }
}

It takes an Valid<Array<u32, N>> and returns Valid<u32>. It transforms the ingress payload to sum of them. In the application-specific logic in map interface combinator, we use fold method which aggregates the data within array of signal.

You can find the implementation in fir_filter.rs.

You can generate the Verilog codes with the following commands:

# Generate a separate Verilog file for each submodule.
$ cargo run --release -- --target fir_filter --deadcode --wire-cache

# Generate an integrated Verilog file combining all submodules.
$ cargo run --release -- --target fir_filter --deadcode --wire-cache --merge

FIFO without duplication

In this tutorial, we will implement a custom FIFO which can be used to provide balanced transfers for multiple input streams, thereby offering balanced latency for each input stream.

Specification

The custom FIFO with and N input streams and M entries performs the following operation:

It has M entries and receives payloads from N input streams. The key point is that it does not accept payloads from an input stream that already has an entry in the FIFO.

For example, in the figure above, the FIFO contains P1 and P2, which are payloads received from the in_1 and in_N, respectively. Therefore, the next payload will be accepted from an input stream other than in_1 and in_N. in_1 and in_N become selectable again after P1 and P2 are removed from the FIFO via output (the circles in front of the FIFO indicates whether each input stream can be received; a red circle means cannot, a green circle means can).

Modular Design

We could represent the custom FIFO in modular way as follows (U<M> is a bitvector of width M):

masked_merge submodule:

It takes N valid-ready interfaces (Vr = I<VrH<P, ()>, VrH is defined in here).
It returns a valid-ready hazard interface (I<VrH<(P, U<{ clog2(N) }>), U<N>>>).
- The U<{ clog2(N) }> in the payload indicates the index of selected ingress interface.
- The U<N> in the resolver indicates the mask bits for the selection.
It will select from the ingress interface which has valid payload with the mask bit has false.
- For example, if N is 4 and the mask bits are [true, true, false, true], then it will try to select from the third ingress interface.
- If multiple ingress interfaces are selectable, the one with the smallest index is chosen.
- In this example, the mask bits means the origin of the current payloads in the FIFO.

map_resolver submodule:

It transforms the inner resolver signal from the FIFO state (FifoS) to the mask bits (U<N>).
- FifoS indicates the current state of the FIFO, including elements and head/tail indices.
- The i-th element of mask bits (U<N>) indicates that one of the payload currently in the FIFO has been selected from the i-th ingress interface of M0.
It does not touch the forward signals and the ready signal.

transparent_fifo submodule:

It takes a payload from the ingress interface when the FIFO is not full.
It returns the oldest payload in the FIFO to the egress interface.
It sends the FIFO's fullness (Ready) and the current FIFO state (FifoS) as the ingress resolver.

map submodule:

It drops the unnecessary index information in the payload.

Implementation

Based on the above submodules, we can implement the custom FIFO in a concise and modular way:

fn custom_fifo(ingress: [Vr<u32>; N]) -> Vr<u32> {
    ingress
        .masked_merge()
        .map_resolver_inner::<FifoS<(u32, U<{ clog2(N) }>), M>>(|fifo_s| {
            fifo_s.inner_with_valid().fold(false.repeat::<N>(), |acc, i| {
                if let Some((_, idx)) = i {
                    acc.set(idx, true)
                } else {
                    acc
                }
            })
        })
        .transparent_fifo()
        .map(|(ip, _idx)| ip)
}

We used u32 for the payload type, N for the number of ingress interfaces and M for the FIFO size.
fifo_s.inner_with_valid() represents the inner elements of the FIFO.
- It has type Array<HOption<u32>, M>, HOption represents the validity of the element.
We fold the inner elements of the FIFO in the map_resolver_inner combinator:
- The initializer is a boolean array with all elements as false of size N
- The index of the initializer array indicates the index of the ingress interfaces.
- We iterate through all the elements within the FIFO and set the accumulator's value at the index in each FIFO element to true.
- The final result indicates which ingress interfaces are present in the current FIFO.
We send back this resolver to the masked_merge combinator to make decision for choosing the next ingress interface.
We filter out the unnecessary index information in the last map combinator.
The implementation of the masked_merge() combinator will be explained in the Implementing Combinators section.

You can find the full implementation in custom_fifo.rs.

You can generate the Verilog codes with the following commands:

# Generate a separate Verilog file for each submodule.
$ cargo run --release -- --target custom_fifo --deadcode --wire-cache

# Generate an integrated Verilog file combining all submodules.
$ cargo run --release -- --target custom_fifo --deadcode --wire-cache --merge

Congratulations! You finished the tutorial!

Concepts

Signal: The data types that can be transferred through wires.
Interface: Communication protocol between modules. It contains forward signal, backward signal and its transfer condition.
Modules: Hardware module. It serves as the building block for a digital circuit.
Interface Combinators: The mechanism of chaining two interfaces together.

Signal

Signal is a collection of types that can be transferred through wires in HazardFlow HDL. We divide these types into two categories: scalar and compound. The HazardFlow HDL reuse some data types from the Rust programming language. Normally we can consider a data type implements Copy trait in Rust as a signal type in the HazardFlow HDL.

NOTE: The Copy trait in Rust and the signal type in HazardFlow are not in a perfect 1:1 relationship. While the signal type in HazardFlow must implement Rust's Copy trait, there are some types like function pointers (e.g., fn() -> i32) that implement the Copy trait but are not treated as signal types in HazardFlow.

For more information of the Rust data types, please refer to The Rust Programming Language Book.

Scalar Types

A scalar type represents a single value.

Boolean

We interpret the Boolean type the same as the bool type in the Rust programming language. In HazardFlow HDL, we also interpret the Boolean value as 0 or 1 and can be sent through the wires when the value is True or False.

Unsigned Integer

In the HazardFlow HDL, we support arbitrary bit-width unsigned integers. We define it as U<N>, where N is the bit-width. We provide a handful of primitive functions for supporting unsigned integers' arithmetic operations, type conversion, and also ordering between different unsigned integers.

Arithmetic operations
- Basic arithmetic operations +, -, * are supported, and logical right shift and logical left shift are supported.
- Some of the arithmetic operations might lead to bit-width changes in the final result.
Type conversion
- We support converting Rust i32, u32, u8, usize, u128, and bool into U<N>.
- We can convert U<N> into u32, u8, and also bool.
Ordering
- We provide ordering functions for developers to easily compare two unsigned integers.

Compound Types

Compound types can group multiple values into one type.

Enum

We interpret the enum type in HazardFlow HDL the same as Rust, the pattern matching feature is supported. The enum type gives you a way of saying a value is one of a possible set of values.

Example: HOption

Similar to the Option type in Rust, the HOption type in HazardFlow HDL is also an enum type. We define HOption type as:

#[derive(Debug, Clone, Copy, HEq)]
enum HOption<T: Copy> {
    /// No value.
    None,
    /// Some value of type `T`.
    Some(T),
}

We provide almost the same handy functions as Rust when we operate on the HOption type. For example, the map function in the HOption type applies a function f on the inner type of HOption if the value is Some(T) else returns None. and_then function often used to chain fallible operations that may return None and it flattens the result, avoiding nested HOption.

Tuple

The tuple type is the same as the tuple type in Rust, and as long as the types within a tuple are signals, then we can treat the tuple type as a signal type that can be sent through the wires too. For example, (U<32>, bool, HOption<U<8>>) can be considered as a signal type.

Struct

Similar to the tuple type, if every field of a struct is a signal type, we can consider the struct as a signal type. For example, the AXI stream can be constructued as follows:

#[derive(Debug, Clone, Copy)]
struct Axis {
    data: U<512>, // TDATA signal
    keep: U<64>,  // TKEEP signal
    last: bool,   // TLAST signal
    user: bool,   // TUSER signal
}

Array

The Array type is primitive in the HazardFlow HDL. We can define an N size sequence of V type data as Array<V, N>. The Array type comes with a handful of handy functions, including indexing, clipping an array, zipping two arrays together, etc.

Example: Unsigned Integer

In HazardFlow HDL, we represent unsigned integer as an Array of bool with bit-width N. When bool is true, we interpret it as a bit with value 1, false as 0.

type U<const N: usize> = Array<bool, N>;

Interface

Interface is a communication protocol between modules. HazardFlow HDL provides a Hazard that abstracts communication protocol with arbitrary transfer conditions (e.g., valid-ready protocol). Like signals, interfaces also support compound types such as tuples and structs.

Hazard

Motivation: Handshake

In hardware semantics, a communication protocol is described as a handshake mechanism.

As an example, the most commonly used valid-ready protocol is described as below:

The sender generates a 1-bit valid signal and a payload signal every clock cycle.
The receiver generates a 1-bit ready signal each clock cycle.
A transfer occurs when both the valid and ready signals are asserted (i.e., both are true).
It's important to note that the payload is continuously present on the wire, regardless of the valid or ready signals.

Example waveform:

At cycle 1, the sender turns on the valid bit with payload 0x42.
- Transfer does not happen because the receiver is not ready.
At cycle 2, the receiver turns on the ready bit.
- Transfer happens because both the valid and ready signals are true.

Specification

In HazardFlow HDL, we abstracted any arbitraty communication protocol into Hazard trait. It describes the necessary information for communication: payload, resolver, and ready function.

trait Hazard {
    type P: Copy;
    type R: Copy;

    fn ready(p: Self::P, r: Self::R) -> bool;
}

For any hazard type H, its member type and functions have the following meaning:

H::P: Payload signal type.
H::R: Resolver signal type.
H::ready: Returns if the receiver is ready to receive with the current payload and resolver pair.

Examples

We provide a few handy primitive hazard interfaces for developers.

`ValidH`

ValidH represents a communication without backpressure (always ready to receive).

It has the following specification:

struct ValidH<P: Copy, R: Copy>;

impl<P: Copy, R: Copy> Hazard for ValidH<P, R> {
    type P = P;
    type R = R;

    fn ready(p: P, r: R) -> bool {
        true
    }
}

For reusability, we added additional resolver signals that simply flow from the receiver to the sender.

`AndH`

For a given hazard specification H, the conjunctive AndH<H> specification adds to H's resolver signal an availability bit flag. Then the receiver is ready if it is available and ready according to the internal specification H at the same time.

struct AndH<H: Hazard>;

struct Ready<R: Copy> {
    ready: bool,
    inner: R,
}

impl<H: Hazard> Hazard for AndH<H> {
    type P = H::P;
    type R = Ready<H::R>;

    fn ready(p: H::P, r: Ready<H::R>) -> bool {
        r.ready && H::ready(p, r.inner)
    }
}

The ready field of the Ready struct represents the availability of the receiver.

`VrH`

We defined the valid-ready hazard VrH<P, R> as AndH<ValidH<P, R>>.

type VrH<P: Copy, R: Copy> = AndH<ValidH<P, R>>;

For reusability, we added additional resolver signals that simply flow from the receiver to the sender.

Interface

An interface is an abstraction that represents the IO of a hardware module.

Typically, a single interface is composed of zero, one, or multiple hazard interfaces.

Specification

trait Interface {
    type Fwd: Copy;
    type Bwd: Copy;

    ...
}

For any interface type I, its member types have the following meaning:

I::Fwd: Forward signal type.
I::Bwd: Backward signal type.

It contains the other methods related to the module and combinator, please refer to these sections for further reading.

Hazard Interface

For an arbitraty hazard specification H, we define the hazard interface I<H, D>, where D is the dependency type. (For more information of the dependency, please refer to the dependency section)

struct I<H: Hazard, D: Dep>;

impl<H: Hazard, const D: Dep> Interface for I<H, D> {
    type Fwd = HOption<H::P>,
    type Bwd = H::R,
}

The interface's forward signal is an HOption type of hazard payload.
The backward signal is the hazard's resolver.
When the forward signal is Some(p) means the sender is sending a valid payload, else it is sending an invalid payload signal.
When we have payload.is_some_and(|p| H::ready(p, r)), the transfer happens.

We define Valid and Vr as the hazard interface types for ValidH and VrH, respectively.

type Valid<P> = I<ValidH<P, ()>, { Dep::Helpful }>;
type Vr<P> = I<VrH<P, ()>, { Dep::Helpful }>;

Compound Interface

Compound types such as tuple, struct, and array also implement the Interface trait.

These interfaces are commonly used for IO of 1-to-N or N-to-1 combinators.

Tuple

Tuple of interfaces (If1, If2) implements Interface trait as follows:

// In practice, it is implemented as a macro.
impl<If1: Interface, If2: Interface> Interface for (If1, If2) {
    type Fwd = (If1::Fwd, If2::Fwd);
    type Bwd = (If1::Bwd, If2::Bwd);
}

The forward signal of the array interface is the tuple of the interfaces' forward signals.
The backward signal of the array interface is the tuple of the interfaces' backward signals.

Struct

#[derive(Debug, Interface)]
struct<If1: Interface, If2: Interface> StructIf<If1, If2> {
    i1: If1,
    i2: If2,
}

By applying the Interface derive macro to a struct in which all fields are interface type, the struct itself can also become an interface type.

Array

Array of interfaces [If; N] also implements Interface trait as follows:

impl<If: Interface, const N: usize> Interface for [If; N] {
    type Fwd = Array<If::Fwd, N>;
    type Bwd = Array<If::Bwd, N>;
}

The forward signal of the array interface is the array of the interfaces' forward signal.
The backward signal of the array interface is the array of the interfaces' backward signal.

Modules

In Rust, the FnOnce trait is automatically implemented for types that can be called once. For example, the primitive function fn(i: Ingress) -> Egress and closures that might consume captured variables (|i: Ingress| -> Egress { .. }) implements FnOnce(Ingress) -> Egress trait.

And in HazardFlow, we treat a function m that implements the FnOnce(Ingress) -> Egress trait where Ingress and Egress implement the Interface trait as a module with Ingress as its ingress interface and Egress as its egress interface.

The common way to construct a module is chaining interface combinators. Please refer to the Interface Combinators for more information.

Example: FIR Filter

In the tutorial, the FIR filter module was implemented as follows:

#![allow(unused)]
fn main() {
fn fir_filter(input: Valid<u32>) -> Valid<u32> {
    let weight = Array::<u32, 3>::from([4, 2, 3]);

    input
        .window::<3>()
        .map(|ip| ip.zip(weight).map(|(e, wt)| e * wt))
        .sum()
}
}

The fir_filter function implements FnOnce(Valid<u32>) -> Valid<u32>, so we can treat it as a module with Valid<u32> as both its ingress and egress interface.

Module Combinators

We provide some convenient module combinators that take a module, modify it, and return a new module.

`seq`

Generates a 1D systolic array from an array of modules.

#![allow(unused)]
fn main() {
fn seq<I: Interface, O: Interface, J: Interface, const N: usize>(
    ms: [fn(I, J) -> (O, J); N],
) -> impl FnOnce([I; N], J) -> ([O; N], J)
}

You can construct an array of modules explicitly from elements, or if all the modules have the same behavior, you can use the from_fn API.

`flip`

Flips a module's input and output.

#![allow(unused)]
fn main() {
fn flip<I1: Interface, I2: Interface, O1: Interface, O2: Interface, T>(
    f: T
) -> impl FnOnce(I2, I1) -> (O2, O1)
where
    T: FnOnce(I1, I2) -> (O1, O2),
}

Interface Combinators

We can chain two interfaces by using combinators. We define a combinator as a method within each hazard interface. The combinator specifies the combinational logic of calculating the egress interface payload, ingress interface resolver, and the next state of the combinator.

Note that in this section, you will see the dependency type D in the combinators' signature, for more information about the dependency, please refer to the dependency section.

Combine Blackbox Module to Interface

The comb method within the interface trait is used to combine the blackbox module to the given interface and return the egress interface.

trait Interface {
    ..

    fn comb<E: Interface>(self, m: impl FnOnce(Self) -> E) -> E {
        m(self)
    }
}

It is useful when we want to combine multiple modules together. For example, we can combine multiple stage modules with comb in the CPU core.

fn core(
    imem: impl FnOnce(Vr<MemReq>) -> Vr<MemRespWithAddr>,
    dmem: impl FnOnce(Vr<MemReq>) -> Vr<MemRespWithAddr>,
) {
    fetch::<START_ADDR>(imem)
        .comb(decode)
        .comb(exe)
        .comb(move |ingress| mem(ingress, dmem))
        .comb(wb)
}

imem and dmem are modules for instruction memory and data memory, respectively.
We chain the 5 sub-modules fetch, decode, exe, mem, and wb by using the comb method.

Generic FSM Combinator

The essence of the interface combinator is the generic fsm combinator.

We provide the fsm combinator that transforms the ingress interface to egress interface with finite state machine. With this combinator, you can represent an arbitrary FSM.

trait Interface {
    type Fwd: Copy;
    type Bwd: Copy;

    fn fsm<E: Interface, S: Copy>(
        self,
        init_state: S,
        f: impl Fn(Self::Fwd, E::Bwd, S) -> (E::Fwd, Self::Bwd, S),
    ) -> E {
        .. // compiler magic
    }

    ..
}

It accepts two arguments which are the FSM's initial state and the combinational logic.

We represent the combinational logic as a function (f). It takes three input signals:

Ingress interface's forward signal (Self::Fwd)
Egress interface's backward signal (E::Bwd)
Current state (S)

and returns three output signals:

Egress interface's forward signal (E::Fwd)
Ingress interface's backward signal (Self::Bwd)
Next state (S)

Standard Combinator Library

We provide standard combinator library for developers to facilitate their work. We can roughly categorize the library into the following categories.

Category	Description
Mapping	Maintains the 1-to-1 relation between the ingress/egress interface.
1-to-N	Splits the ingress interface into multiple egress interfaces.
N-to-1	Merges multiple ingress interfaces into one egress interface.
Register	Stores the state into registers and could delay for one or multiple cycles.
Source/sink	Generates or consumes the interface.
FSM	Runs a finite state machine with an internal state.
Conversion	Converts ingress hazard interface into egress hazard interface.

For more details about all combinators, please refer to the rustdoc.

Mapping

These combinators either transform the payload (ingress to egress) or transform the resolver (egress to ingress).

We demonstrate the two most representative combinators: filter_map and map_resolver.

`filter_map`

As the name suggested, filter_map combinator has the functionality of filter and map. It filters out the payload not satisfying certain conditions and transforms the ingress payload P to egress payload EP. We demonstrate the filter_map which takes Vr and returns Vr<EP>.

It can be implemented with using fsm combinator like this:

impl<P: Copy> Vr<P> {
    fn filter_map<EP: Copy>(self, f: impl Fn(P) -> HOption<EP>) -> Vr<EP> {
        self.fsm::<Vr<EP>, ()>(|ip, er, _| {
            let (ep, ir) = (ip.and_then(f), er);
            (ep, ir, ())
        })
    }
}

It is stateless, and egress transfer happens when (1) ingress transfer happens, and (2) f returns Some with given ingress payload.

For example, let's consider the following f:

fn f(i: u32) -> HOption<bool> {
    if i == 0 {
        // If `i` is zero, filters out the payload.
        None
    } else if i & 1 == 0 {
        // If `i` is even number, returns `true`.
        Some(true)
    } else {
        // If `i` is odd number, returns `false`.
        Some(false)
    }
}

Then the cycle-level behavior of filter_map is as follows:

(T and F means true and false, respectively)

Cycle 0: transfer happens both at ingress and egress side.
Cycle 2: transfer happens at ingress side, but it was filtered out by f.
Cycle 5: transfer does not happens because egress is not ready to receive.

Note that filter_map can work with ingress interfaces I<VrH<P, R>, D>, I<ValidH<P, R>, D> and I<H, D> and there are variants of the filter_map like filter_map_drop_with_r.

`map_resolver`

This combinator transforms the egress resolver to the ingress resolver and leaves the payload untouched. We demonstrate the map_resolver with ingress interface I<VrH<P, R>, D>. Similar to filter_map, map_resolver has other variants and can also work with other ingress interfaces.

impl<P: Copy, R: Copy, const D: Dep> I<VrH<P, R>, D> {
    fn map_resolver<ER: Copy>(self, f: impl Fn(Ready<ER>) -> R) -> I<VrH<P, ER>, D> {
        self.fsm::<I<VrH<P, ER>, D>, ()>(|ip, er, _| {
            let ep = ip;
            let ir = Ready::new(er.ready, f(er));
            (ep, ir, ())
        })
    }
}

It is stateless.
It transforms the egress resolver into an ingress resolver with a given f within the same clock cycle.
The egress transfer condition is always the same as the ingress transfer condition.
It leaves the ingress payload untouched.
This combinator usually being used as a connector for two other combinators whose resolver types are not compatible.

For example, let's consider the following f:

fn f(i: u32) -> bool {
    // Returns the parity of `i`.
    i & 1 == 0
}

Then the cycle-level behavior of map_resolver is as follows:

It does not touch the forward signals and backward ready signal.
It transforms the backward inner signal to the parity.
Transfer happens at both sides at cycle 1 and 5.

1-to-N

These combinators can either duplicate a single ingress interface into multiple egress interfaces or select one from the numerous egress interfaces to transfer the payload.

We demonstrate the two most representative combinators: lfork and branch.

`lfork`

This combinator delivers the ingress payload to all the egress interfaces' egress payload when all the egress interfaces are ready to receive the ingress payload, and also combines all the egress interfaces' resolvers to the ingress resolver. We demonstrate lfork with the ingress interface Vr<P, D>, whose resolver is (). Note that lfork can work with other ingress interfaces such as I<VrH<P, (R1, R2)>, D>, I<VrH<P, Array<R, N>>, D>, etc.

impl<P: Copy, const D: Dep> Vr<P, D> {
    fn lfork(self) -> (Vr<P, D>, Vr<P, D>) {
        self.fsm::<(Vr<P, D>, Vr<P, D>), ()>(|ip, (er1, er2), _| {
            let ep1 = if er2.ready { ip } else { None };
            let ep2 = if er1.ready { ip } else { None };
            let ir = Ready::new(er1.ready && er2.ready, ());
            ((ep1, ep2), ir, ())
        })
    }
}

It is stateless.
Ingress, first egress, and second egress transfer happens at same cycle.

The example cycle-level behavior of lfork is as follows:

Cycle 1, 4, 5: transfer happens at ingress, first egress, and second egress sides.

`branch`

This combinator splits a single ingress interface into multiple egress interfaces and only selects one of the egress interfaces to transfer the payload, also combines all the egress interfaces' resolvers into the ingress resolver. We demonstrate branch with the ingress interface Vr<P, BoundedU<N>>.

BoundedU<N> can be considered as a bounded unsigned integer, if N is 3, the possible unsigned integers are 0, 1, 2. For more information of the BoundedU<N>, please refer to the doc.rs.

impl<P: Copy, const N: usize> Vr<(P, BoundedU<N>)> {
    fn branch(self) -> [Vr<P>; N] {
        self.fsm::<[Vr<P>; N], ()>(|ip, ers, _| {
            let Some((ip, sel)) = ip else {
                // Ingress ready signal is true when valid signal is false.
                return (None.repeat::<N>(), Ready::new(true, ers.map(|r| r.inner)), ());
            };

            let ep = None.repeat::<N>().set(sel.value(), Some(ip));
            let ir = Ready::new(ers[sel.value()].ready, ers.map(|r| r.inner));

            (ep, ir, ())
        })
    }
}

self is the ingress interface Vr<(P, BoundedU<N>)>.
We can interpret the BoundedU<N> as the selector to choose egress interfaces for transferring the ingress payload.
When the selected egress interface ready signal is true, and also the ingress payload is valid, both ingress transfer and egress transfer happen, else both ingress and egress do not happen.
Ingress payload will only be transferred to the selected egress interface when both ingress and egress transfer happen.
Ingress resolver and all the egress resolvers are ().

Let's assume the ingress payload is HOption<(u32, BoundedU<2>)>. The payload is u32, and we split the ingress interface into 2 egress interfaces.

The cycle level behavior of branch:

Cycle 2: transfer happens at ingress and first egress side.
Cycle 5: transfer happens at ingress and second egress side.

N-to-1

These combinators merge multiple ingress interfaces into a single egress interface. The egress interface could contain all the ingress interfaces' payload and resolver or select one of the ingress interfaces.

We demonstrate the two most representative combinators: join and merge.

`join`

This combinator merges the ingress interfaces' payload and resolver. We demonstrate this combinator with the ingress interfaces (Vr<P1>, Vr<P2>).

impl<P1: Copy, P2: Copy> JoinExt for (Vr<P1>, Vr<P2>) {
    type E = Vr<(P1, P2)>;

    fn join(self) -> Vr<(P1, P2)> {
        self.fsm::<Vr<(P1, P2)>, ()>(|(ip1, ip2), er, _| {
            let ep = ip1.zip(ip2);
            let ir1 = if ip2.is_some() { er } else { Ready::invalid(()) };
            let ir2 = if ip1.is_some() { er } else { Ready::invalid(()) };
            (ep, (ir1, ir2), ())
        })
    }
}

The egress payload is an array of all the ingress payloads.
The egress payload will be valid only when all the ingress payloads are valid.
The ingress transfer happens only when all the ingress payloads are valid and also egress interface is ready to receive payload.

The example cycle-level behavior of join is as follows:

Cycle 1, 4, 5: transfer happens.

`merge`

This combinator will select one from the ingress interfaces to deliver the ingress payload to the egress payload and also leave the inner of the egress resolver untouched to the ingress interfaces. We will demonstrate the cmerge combinator with 2 ingress interfaces [Vr<u32>, Vr<u32>].

Note that in our code base, we implement cmerge with ingress interfaces [I<AndH<H>, D>; N]. We can consider the H as ValidH and N as 2.

impl<P: Copy> MergeExt for (Vr<P>, Vr<P>) {
    type E = Vr<P>;

    fn merge(self) -> Vr<P> {
        self.fsm::<Vr<P>, ()>(|(ip1, ip2), er, _| {
            let ep = ip1.or(ip2);
            let ir1 = er;
            let ir2 = if ip1.is_none() { er } else { Ready::invalid(()) };
            (ep, (ir1, ir2), ())
        })
    }
}

This combinator will select the first ingress interface, whose ingress payload is valid, from the array of the ingress interfaces, when the egress interface is ready to receive the payload.

The example cycle-level behavior of merge is as follows:

Cycle 1: first ingress and egress transfer happens.
Cycle 3: first ingress and egress transfer happens.
Cycle 5: second ingress and egress transfer happens.

Register slices

These registers can maintain the states in their registers and could delay one or multiple cycles to send out the states.

We demonstrate the two most representative combinators: reg_fwd and fifo.

`reg_fwd`

We can use reg_fwd to make a pipeline, and reduce the critical path.

Let's assume we have a circuit with ingress interface Vr:

Transforming from ingress payload from P to EP2 needs to happen within a cycle, the critical path becomes cpath(f1) + cpath(f2). To resolve this, we add a reg_fwd combinator between those two map combinators.

Then the critical path is reduced to max(cpath(f1), cpath(f2)).

impl<P: Copy> Vr<P> {
    fn reg_fwd(self) -> Vr<P> {
        self.fsm::<Vr<P>, HOption<P>>(None, |ip, er, s| {
            let ep = s;
            let et = ep.is_some_and(|p| er.ready);

            let ir = Ready::new(s.is_none() || et, (er.inner, s));
            let it = ip.is_some_and(|p| ir.ready);

            let s_next = if it {
                ip
            } else if et {
                None
            } else {
                s
            };

            (ep, ir, s_next)
        })
    }
}

The current state is the valid egress payload.
The ingress interface is ready to receive a valid payload whenever the current register is empty or the egress transfer happens.
If the ingress transfer happens, then the register stores the new valid payload as the next state.
If the egress transfer happens, but the ingress transfer does not happen, then the register will be empty in the next cycle.
If neither ingress transfer nor egress transfer happens, then the next state stays the same as the current state.
The only difference between pipe is true or false is the ingress transfer happens only when the current register is empty, delaying one cycle compared to the pipeline.

Let's assume the ingress interface type is Vr<u32>.

The example cycle-level behavior of reg_fwd is as follows:

Cycle 0, 1, 3, 5: Ingress transfer happens.
Cycle 1, 2, 5: Egress transfer happens.

`fifo`

This combinator is a pipelined FIFO queue, it can accept a new element every cycle if the queue is not full.

impl<P: Copy> Vr<P> {
    fn fifo<const N: usize>(self) -> Vr<P> {
        self.fsm::<Vr<P>, FifoS<P, N>>(FifoS::default(), |ip, er, s| {
            let FifoS { inner, raddr, waddr, len } = s;

            let empty = len == U::from(0);
            let full = len == U::from(N);

            let enq = ip.is_some() && !full;
            let deq = er.ready && !empty;

            let ep = if s.len == 0.into_u() { None } else { Some(s.inner[s.raddr]) };
            let ir = Ready::new(!full, ());

            let inner_next = if enq { inner.set(waddr, ip.unwrap()) } else { inner };
            let len_next = (len + U::from(enq).resize() - U::from(deq).resize()).resize();
            let raddr_next = if deq { wrapping_inc::<{ clog2(N) }>(raddr, N.into_u()) } else { raddr };
            let waddr_next = if enq { wrapping_inc::<{ clog2(N) }>(waddr, N.into_u()) } else { waddr };

            let s_next = FifoS { inner: inner_next, raddr: raddr_next, waddr: waddr_next, len: len_next };

            (ep, ir, s_next)
        })
    }
}

Let's assume the ingress interface type is Vr<u32> with capacity is 3.

The example cycle-level behavior of fifo is as follows:

The ingress ready signal is true as long as the queue is not full.
It does not give maximum throughput because both ingress transfer and egress transfer cannot happen simultaneously when the FIFO is full.
Cycle 0, 1, 2, 3, 6: Ingress transfer happens.
Cycle 1, 5, 6, 7: Egress transfer happens.

Source and sink

These combinators have only either ingress interface or egress interface.

We demonstrate the two most representative combinators: source and sink.

`source`

This combinator immediately returns the data to the payload when the data is coming from resolver. The source combinator only has the egress interface.

impl<P: Copy> I<VrH<P, P>, { Dep::Demanding }> {
    fn source() -> I<VrH<P, P>, { Dep::Demanding }> {
        ().fsm::<I<VrH<P, P>, { Dep::Demanding }>, ()>((), |_, er, _| {
            let ep = if er.ready { Some(er.inner) } else { None };
            (ep, (), ())
        })
    }
}

The egress resolver type is the same as the egress payload type P.
The egress transfer happens as long as the egress resolver ready signal is true.
It transfer the resolver to the payload within the same clock cycle with egress transfer happens.

When P is u32, the example cycle-level behavior of source is as follows:

Cycle 0, 1, 2, 4, 5: Egress transfer happens.
Cycle 3: Egress transfer does not happen because egress ready signal is false.

`sink`

This combinator maintains a state and calculates the ingress resolver with f, which takes the current state and ingress payload as inputs. The sink combinator only has the ingress interface.

impl<P: Copy> I<VrH<P, HOption<P>>, { Dep::Helpful }> {
    fn sink(self) {
        self.fsm::<(), ()>((), |ip, _, _| {
            let ir = Ready::valid(ip);
            ((), ir, ())
        })
    }
}

When P is u32, the example cycle-level behavior of sink is as follows:

The ingress resolver is calculated every clock cycle.

FSM

FSM combinators run a finite state machine described by the closure given to the combinator. They have an internal state for the FSM, and the closure describes state transitions and how the combinator should behave each state.

The "FSM mapping" combinators such as fsm_map or fsm_filter_map are very similar to their regular mapping combinator counterparts. The difference is that they have an internal state that can be controlled by the given closure.

The other two remaining FSM combinators are fsm_ingress and fsm_egress. Since they have more complex behavior, we demonstrate their usage here.

`fsm_ingress`

It allows you to accumulate successive ingress payloads into an internal FSM state, then output the resulting state once it is ready. After the resulting state is transferred, the FSM is reset and starts accumulating again.

impl<P: Copy> Vr<P> {
    fn fsm_ingress<S: Copy>(self, init: S, f: impl Fn(P, S) -> (S, bool)) -> Vr<S> {
        self.fsm::<Vr<S>, (S, bool)>((init, false), |ip, er, (s, done)| {
            let ir = Ready::new(!done, er.inner);

            let it = ip.is_some() && !done;
            let et = er.ready && done;

            let ep = if done { Some(s) } else { None };

            let s_next = if it {
                f(ip.unwrap(), s)
            } else if et {
                (init, false)
            } else {
                (s, done)
            };

            (ep, ir, s_next)
        })
    }
}

It takes the initial state and the combinational logic which calculate the next state and whether the FSM is done.

For example, let's consider the following sum_until_10 function.

It accumulates the input numbers until it becomes greater or equal to 10, and outputs the result.

fn sum_until_10(input: Vr<u32>) -> Vr<u32> {
    input
        .fsm_ingress(0, |ip, _, sum| {
            let sum_next = sum + ip;
            let done = sum >= 10;

            (sum_next, done_next)
        })
}

The example cycle-level behavior of fsm_ingress is as follows:

At cycle 0-1 and cycle 3-4, ingress transfer happens (accumulates the input).
At cycle 2 and cycle 5, egress transfer happens (outputs the result).

`fsm_egress`

It runs an FSM for each transferred ingress payload, allowing you to process an ingress payload using multiple FSM states. Only after the FSM is finished, the combinator can accept a new ingress payload.

impl<P: Copy> Vr<P> {
    fn fsm_egress<EP: Copy, S: Copy>(self, init: S, flow: bool, f: impl Fn(P, S) -> (EP, S, bool)) -> Vr<EP> {
        self.fsm::<Vr<EP>, (HOption<P>, S)>((None, init), |ip, er, (sp, s)| {
            let (ep, s_next, is_last) = if let Some(p) = sp {
                let (ep, s_next, is_last) = f(p, s);
                (Some(ep), s_next, is_last)
            } else if flow && ip.is_some() && sp.is_none() {
                let (ep, s_next, is_last) = f(ip.unwrap(), s);
                (Some(ep), s_next, is_last)
            } else {
                (None, s, false)
            };

            let et = ep.is_some() && er.ready;
            let ir = Ready::new(sp.is_none() || (et && is_last), ());
            let it = ip.is_some() && ir.ready;

            let (sp_next, s_next) = if flow && it && et && sp.is_none() {
                if is_last {
                    (None, init)
                } else {
                    (ip, s_next)
                }
            } else if it {
                (ip, init)
            } else if et && is_last {
                (None, init)
            } else if et {
                (sp, s_next)
            } else {
                (sp, s)
            };

            (ep, ir, (sp_next, s_next))
        })
    }
}

self is the ingress interface I<VrH<P, R>, D>.
init is the initial state for the FSM.
flow determines whether the FSM starts immediately or from the next cycle of an ingress transfer.
f takes the current saved ingress payload and the current FSM state. It returns an egress payload, the next state, and whether this is the last state for the FSM.

For example, let's consider the following consecutive_3 example.

It outputs 3 consecutive numbers starting from each input number.

fn consecutive_3(input: Vr<u32>) -> Vr<u32> {
    input.fsm_egress(0, true, |p, count| {
        let ep = p + count;
        let count_next = count + 1;
        let is_last = count == 2;
        (ep, count_next, is_last)
    })
}

The example cycle-level behavior of fsm_egress is as follows:

At T0, an ingress transfer happens and an FSM is started immediately (since flow is true).
- The ingress payload Some(0) is saved to sp, and will be available from the next cycle.
From T0 to T2, the FSM is running with the saved payload Some(0), and it outputs 0, 1, 2.
At T2, f returns true for is_last, signaling that this is the last state for this FSM. This sets the ingress ready signal to true to accept a new ingress payload for the next FSM.
- Since the ingress payload Some(1) is already available, an ingress transfer happens and it is saved.
At T3, the FSM state is reset and a new FSM is started.
From T3 to T5, the FSM is running with the saved payload Some(1), and it outputs 1, 2, 3.

Implementing Your Own Combinators

The HazardFlow HDL standard library provides some primitive combinators, however we need to implement our custom combinators sometimes for specific logic. In fact, the HazardFlow HDL standard library is also implemented in the same way as what we will introduce in this section.

We introduced the masked_merge combinator specification in the Tutorial section, but we did not give out the concrete implementation.

The masked_merge combinator is a custom combinator which is not provided by the HazardFlow HDL standard library, thus we need to implement its logic by ourselves.

Implementation

To define a custom combinator for interface type [Vr; N], we need to define a custom trait first.

/// Masked merge trait
trait MaskedMergeExt<P: Copy + Default, const N: usize>: Interface
where [(); clog2(N)]:
{
    /// Hazard type
    type EH: Hazard;

    /// Fair Mux
    fn masked_merge(self) -> I<Self::EH, { Dep::Demanding }>;
}

The custom trait's name is MaskedMergeExt.
- It specifies the general payload type P need to be constrained by the Copy and Default traits.
- const N: usize specifies the number of ingress interfaces.
- The MaskedMergeExt need to extend the Interface trait, since we need to impl this trait later for Implementing our custom combinator.
- where [(); clog2(N)] is telling the HazardFlow HDL compiler that clog2(N) is a constant.
We define the egress hazard as Hazard type, which is a hazard protocol with a given payload, resolver, and the ready function.
fn masked_merge(self) -> I<Self::EH, { Dep::Demanding }> defines the combinator's name masked_merge and specifies the egress hazard is EH.

We can define the combinational logic now.

impl<P: Copy + Default, const N: usize> MaskedMergeExt<P, N> for [Vr<P>; N]
where [(); clog2(N)]:
{
    type EH = VrH<(P, U<{ clog2(N) }>), Array<bool, N>>;

    fn masked_merge(self) -> I<Self::EH, { Dep::Demanding }> {
        unsafe {
            self.fsm::<I<VrH<(P, U<{ clog2(N) }>), Array<bool, N>>, { Dep::Demanding }>, ()>((), |ip, er, s| {
                if !er.ready {
                    let ir = Ready::new(false, ()).repeat();
                    let ep = None;
                    return (ep, ir, s);
                }

                let ep_idx = ip.zip(er.inner).find_idx(|(p, selected)| p.is_some() && !selected);
                let ep = if let Some(idx) = ep_idx { Some((ip[idx].unwrap(), idx)) } else { None };

                let ir = Ready::invalid().repeat::<N>().set_cond(ep.is_some(), ep_idx.unwrap(), Ready::valid(()));
                (ep, ir, s)
            })
        }
    }
}

We impl the custom trait MaskedMergeExt for the compound interface type [Vr; N].
We define the egress hazard as VrH<(P, U<{ clog2(N) }>), Array<bool, N>>
- The egress interface is a valid-ready interface with valid-ready hazard.
- U<{ clog2(N) }> is the bit representation of the index of the ingress interfaces.
- The inner resolver is Array<bool, N> which indicates the index of the ingress interfaces are present in the current queue.
unsafe code block is necessary for Implementing your own custom combinator, since fsm function need to be in the unsafe code block.
The fsm function specifies the egress interface type is I<VrH<(P, U<{ clog2(N) }>), Array<bool, N>.
The fsm function takes two inputs and returns the egress interface.
- The initial state of the combinator.
- An anonymous function takes three inputs.
 - The ingress payload ip. The type is Array<HOption, N>, which is all the ingress interfaces' payload.
 - The egress resolver er. The type is Ready<Array<bool, N>>, which is a ready signal with an array that indicates the index of the ingress interfaces present in the current queue.
 - The initial state s. The type is simply a (), since we do not have a state in this combinator.
- The anonymous function returns a tuple including:
 - The egress payload ep. The type is HOption<(P, Array<bool, _>)>, which is the actual data we want to transfer and also the index of the ingress interface. Note that the bit representation of an unsigned integer is an array of bool.
 - The ingress resolver ip. The type is Array<Ready<()>, N>, which is an array of ready signals with size N. It is a collection of the ready signal for each ingress interface.
If the egress resolver ready signal is false, which means the egress interface is not ready to transfer the payload, we are not transferring any payload.
- We set the ingress resolver's ready signal as false
- We set the egress payload as None.
If the egress resolver ready signal is true, which means the egress interface is ready to transfer the payload.
- We find the first index of the ingress interfaces whose payload is Some(P) and also not in the current queue.
  - zip function zips two arrays and returns a new array whose element is a tuple of both elements from the two arrays.
  - find_idx finds the index of first element that satisfies given condition.
- We set the egress payload as a tuple containing the selected ingress interface's actual payload and its index.
- We set the selected ingress resolver ready signal as true and the rest of the ingress interfaces' ready signal as false.

Dependency Types

We defined the "dependency type" based on the dependency from the backward signal to the forward signal. The concept was inspired from the BaseJump STL.

Motivation: Combinational Loop

The following circuit is constructed by connecting source, map_resolver, and sink combinators.

fn m() {
    I::<VrH<Opt<u32>, Opt<u32>>>::source()
        .map_resolver::<Opt<u32>>(|p: Ready<Opt<u32>>| p.inner.unwrap_or(0))
        .sink()
}

In this circuit, we can consider the dependency between the f2 and b2 signals from two perspectives:

Since the source combinator forwards the backward signals to the forward signals and the map_resolver combinator preserves the dependencies between its forward and backward signals, there is a dependency b2 -> b1 -> f1 -> f2.
Since the sink combinator forwards the forward signals to the backward signals, there is a dependency f2 -> b2.

Thus, a cyclic dependency exists between f2 and b2. This is commonly referred to as a combinational loop, which makes the circuit never stabilize and causes the simulator to hang. A combinational loop occurs when the output of some combinational logic is fed back into the input of that same combinational logic without any intervening register.

Dependency Types

To prevent combinational loops in circuit design, we use a type system called "Dependency Type".

We categorize interfaces into two kinds of types: Helpful and Demanding.

Helpful: forward signals does not depend on its backward signals.
Demanding: forward signals depends on its backward signals, and they satisfy the condition that if the payload is Some, the ready condition is true.

Based on the above definition, in the motivational circuit, both the interface between source and map_resolver and the interface between map_resolver and sink are of the Demanding type.

NOTE: In our type system, we do not consider the case where the forward signals depend on the backward signals, the payload is Some, but the ready condition is false. This case is guaranteed not to occur when the circuit is designed only with the combinator defiend in our standard combinator library.

In HazardFlow, we defined the dependency type as an enum Dep:

/// Dependency type of a hazard interface.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, ConstParamTy)]
enum Dep {
    /// The payload (`Fwd`) does not depend on the resolver (`Bwd`).
    Helpful = 0,

    /// The payload (`Fwd`) depends on the resolver (`Bwd`), and they satisfy the condition that if the payload is
    /// `Some`, `Hazard::ready(p, r)` is true.
    ///
    /// It is a bug to make the payload depend on the resolver but break the condition.
    Demanding = 1,
}

and we annotated the dependency type to the hazard interface.

/// Hazard interface.
#[derive(Debug)]
struct I<H: Hazard, const D: Dep>;

The benefit of using dependency types is that we can restrict the usage of combinators that introduce a combinational loop, which leads to more productive circuit design. We will explain more with the example of interface combinators.

Examples: Types for Interface Combinators

Let's look into the IO type for the interface combinators in the motivational circuit.

`source`

impl<P: Copy> I<VrH<P, P>, { Dep::Demanding }> {
    fn source() -> I<VrH<P, P>, { Dep::Demanding }> {
        ().fsm::<I<VrH<P, P>, { Dep::Demanding }>, ()>((), |_, er, _| {
            let ep = if er.ready { Some(er.inner) } else { None };
            (ep, (), ())
        })
    }
}

Since the forward signals (ep) depend on the backward signals (er), its egress interface has a Demanding type. Therefore, it returns an I<VrH<P, P>, { Dep::Demanding }>.

`map_resolver`

impl<P: Copy, R: Copy, const D: Dep> I<VrH<P, R>, D> {
    fn map_resolver<ER: Copy>(self, f: impl Fn(Ready<ER>) -> R) -> I<VrH<P, ER>, D> {
        self.fsm::<I<VrH<P, ER>, D>, ()>(|ip, er, _| {
            let ep = ip;
            let ir = Ready::new(er.ready, f(er));
            (ep, ir, ())
        })
    }
}

The egress forward signals (ep) depend on the ingress forward signals (ip), while ingress backward signals (ir) depend on the egress backward signals (er). Therefore, if the ingress interface has a Helpful type (ir -/-> ip), then the egress interface will also have a Helpful type (er -> ir -/-> ip -> ep). Similarly, if the ingress interface has a Demanding type, then the egress interface will also have a Demanding type. Consequently, it takes an I<VrH<P, R>, D> and returns an I<VrH<P, ER>, D>, where D can be any type.

`sink`

impl<P: Copy> I<VrH<P, Opt<P>>, { Dep::Helpful }> {
    fn sink(self) {
        self.fsm::<(), ()>((), |ip, _, _| {
            let ir = Ready::valid(ip);
            ((), ir, ())
        })
    }
}

Since the backward signals (ir) depend on the forward signals (ip), a combinational loop occurs when the ingress interface has a Demanding type (ir -> ip -> ir). To prevent this, we only allow the ingress interface to have a Helpful type. Therefore, it takes only I<VrH<P, Opt>, { Dep::Helpful }>.

Now let's apply the combinator types introduced above to the motivational circuit.

The egress interface of source combinator is I<VrH<u32, u32>, { Dep::Demanding }>, and the egress interface of map_resolver combinator is I<VrH<u32, Opt<u32>>, { Dep::Demanding }>. However, since the sink combinator only takes ingress interface of Helpful type, we cannot apply the sink combinator to the egress interface of map_resolver. This prevents the occurrence of combinational loops. To apply the sink combinator, we need to change the type to Helpful by using combinators like reg_fwd.

Limitations

Currently, a limitation exists in that dependency types cannot capture all types of combinational loops. This is because dependency types only check for intra-interface dependencies and do not consider for inter-interface dependencies. The most common example of this is the fork-join pattern.

In the circuit described above, a combinational loop exists with f1 -> b2 -> f1 and f2 -> b1 -> f2. However, the current type system cannot detect such loops. For the lfork combinator, since there is no dependency from b1 -> f1 or b2 -> f2, if the ingress interface type is Helpful, then both egress interface types will also be Helpful.

And for the join combinator, it is designed to take two Helpful ingress interfaces (let's call them i1 and i2) and return a Helpful egress interface. However, a combinational loop occurs if there is a dependency from the backward signals of i1 to the forward signals of i2, or from the backward signals of i2 to the forward signals of i1.

Since the current type system cannot represent inter-interface dependencies, this remains as a limitation. For now, we expect that these types of combinational loops will be detected by synthesis tools.

CPU Core (5-Stage Pipelined)

Our baseline CPU implements the RV32I+Zicsr ISA using a 5-stage pipelined design. It is based on the riscv-sodor processor developed by Berkeley, with slight modifications.

Overview: Pipelined Design

The 5-stage pipelined CPU core would ideally be decomposed as follows:

At a high level, each stage performs the following tasks:

Fetch: Computes the next PC and fetches the instruction bytecode from memory.
Decode: Decodes the bytecode and reads the value of source registers.
Execute: Performs arithmetic and bitwise operations based on the instruction.
Memory: Accesses memory for load/store instructions.
Writeback: Writes the result back to the destination register.

Example: `lw` instruction

Let's consider the following program with next PC was computed as 0x8000012c in the fetch stage.

By the previous auipc and addi instructions, the value of x3 becomes 0x80002000, which is the start address of the data section (we omit the details of auipc and addi instructions here).

Disassembly of section .text:

...
80000124:  00002197  auipc  x3, 0x2       # x3 <- 80002124
80000128:  edc18193  addi   x3, x3, -292  # x3 <- 80002000
8000012c:  0040a703  lw	    x4, 4(x3)     # x4 <- mem[x3+4]
...

Disassembly of section .data:

...
80002004:  deadbeef
...

Then the lw instruction will be processed in each stage as follows:

Fetch: Accesses memory with the next PC 0x8000012c and fetch the bytecode 0x0040a703.
Decode:
- Decodes the bytecode 0x0040a703. It contains the operations in the later stages:
  - In the execute stage, computes the memory access address x3 + 4.
  - In the memory stage, loads data from the memory with the computed address.
  - In the writeback stage, writes the load data to the destination register x4.
- Reads the value of the source register x3 (= 0x80002000).
Execute: Computes the memory access address x3 + 4 (= 0x80002004).
Memory: Accesses the memory with address 0x80002004 and get the data 0xdeadbeef.
Writeback: It writes the data 0xdeadbeef to the x4.

It is quite intuitive, but it becomes more complex due to hazards between pipeline stages.

We will look deeper into hazards and their handling in the following subsections.

Hazards

In this section, we will explain the hazards that can occur in our baseline CPU.

The hazards could be classified into: structural, data, and control.

Structural Hazards

Structural hazard occurs when multiple pipeline stages need to access the same hardware resource such as memory, register file, and arithmetic logic unit (ALU) at the same clock cycle. It should be resolved by stalling the stage which trying to access later:

Solution: Stall

I1:  lw   x2, 0(x1)   # write to x2
I2:  sub  x4, x3, x2  # read from x2 requires several cycle in the decode stage
I3:  add  x4, x3, x2  # need to wait in the fetch stage

In the example above, the I2 must wait several cycles in the decode stage to read the correct value of x2 from I1 (details on how I1 and I2 communicate will be explained later in here; for now, you can omit I1 in this example).

As a result, structural hazard occurs at cycle T+2 because I3 is ready to move to the decode stage while I2 still needs to occupy the decode stage. To resolve this hazard, the decode stage asks I3 to stall at cycle T+2. It is usually done by turning off the ready bit in the valid-ready protocol.

Data Hazards

Data hazard occurs when the processing of a pipeline stage depends on the result of later stages. It should be resolved by stalling the stage if its data dependency is not made available yet; or bypassing the necessary data from later stages in the same clock cycle.

For instance, a data hazard due to read-after-write dependency in CPU core is resolved either by stall the read instruction in the decode stage or by bypassing the result of the write instruction in the later stages to the read instruction in the decode stage.

Solution (1/2): Bypassing

In the decode stage, we need to read the value of source registers. When the RAW dependency happens, we can bypass the values from the later stages to the decode stage.

From the execute stage:

I1:  add  x3, x2, x1  # write to x3
I2:  sub  x5, x4, x3  # read from x3

At cycle T+2, I1 can bypass the value of x3 to the I2.

From the memory stage:

I1:  add  x3, x2, x1  # write to x3
I2:  nop
I3:  sub  x5, x4, x3  # read from x3

At cycle T+3, I1 can bypass the value of x3 to the I3.

From the writeback stage:

I1:  add  x3, x2, x1  # write to x3
I2:  nop
I3:  nop
I4:  sub  x5, x4, x3  # read from x3

At cycle T+4, I1 can bypass the value of x3 to the I4.

Solution (2/2): Stall

When data from a later stage is not yet ready, bypassing is not possible, and a stall becomes necessary. In our baseline CPU, there are two sources of stall: load instruction, and CSR instruction.

From the load instruction:

I1:  lw   x2, 0(x1)   # write to x2
I2:  sub  x4, x3, x2  # read from x2

At cycle T+2, I2 need to be stalled at the decode stage because I1 did not reach the memory.
At cycle T+3, I1 gets the value of x2 from the memory, now can bypass the value to the I2.

From the CSR instruction:

I1:  csrr  x2, mcause    # write to x2
I2:  bgez  x2, 8000003c  # read from x2

In our baseline CPU, the CSR is located in the memory stage.

At T+2, I2 need to be stalled at the decode stage becuase I1 did not reach the CSR.
At T+3, I1 gets the value of x2 from the CSR, now can bypass the value to the I2.

For more details on CSR instructions, please refer to the Chapter 2.8 of RISC-V Specifications.

Control Hazards

Control hazard occurs when a pipeline stage makes wrong predictions on which instructions to execute in the next clock cycle. It should be resolved, e.g., when the execute stage detects a misprediction, by discarding the mispredicted instructions at the fetch and decode stages and restarting from the correct next instruction.

Solution: Discarding and Restarting

In our baseline CPU, there are two sources of control hazard: branch misprediction and exception.

Branch Misprediction

I1:  beq  x2, x1, target  # mispredicted to not taken
I2:  add  x5, x4, x3      # should be killed
I3:  lw   x5, 0(x4)       # should be killed

target:
I4:  sub  x5, x4, x3      # correct target address

At cycle T+1, the fetch stage speculates that I1's next instruction is I2 so that it is fetched from the instruction memory. But at cycle T+2, the execute stage deduces that I1's next instruction is in fact I4. As such, the mispredicted instructions I2 and I3 are discarded at cycle T+2, and the fetch stage is restarted with the correct next instruction I4 at cycle T+3.

Exception

I1:  unimp            # illegal instruction, redirect to trap vector
I2:  add  x4, x3, x2  # should be killed
I3:  sub  x4, x3, x2  # should be killed
I4:  lw   x4, 0(x3)   # should be killed

trap_vector:
I5:  csrr x5, mcause  # trap handling logic

The illegal instruction I1 generates the exception and should be redirected to the trap vector to handle the exception. At cycle T+3, the illegal instruction I1 reaches the CSR in the memory stage, and it returns the trap vector address. As such, the mispredicted instructions I2, I3, and I4 are discarded at cycle T+3, and the fetch stage is restarted with the correct next instruction I5 at cycle T+4.

Implementation

With the existence of hazards, the 5-stage pipelined CPU core can be decomposed as follows:

with the following implementation (riscv32_5stage.rs):

const START_ADDR: u32 = 0x80000000;

fn core(
    imem: impl FnOnce(Vr<MemReq>) -> Vr<MemRespWithAddr>,
    dmem: impl FnOnce(Vr<MemReq>) -> Vr<MemRespWithAddr>,
) {
    fetch::<START_ADDR>(imem)
        .comb(decode)
        .comb(exe)
        .comb(move |i| mem(i, dmem))
        .comb(wb)
}

imem and dmem are modules for instruction memory and data memory, respectively.
We chain the 5 sub-modules fetch, decode, exe, mem, and wb by using the comb method.

In the following subsections, we will explain the implementation details for each stage.

Fetch stage

The fetch stage mainly do the following things:

Calculates the next PC for the upcoming instruction.
Accesses IMEM to retrieve the instruction bytecode.

It can be decomposed into combinators as follows (code):

FetEP (in fetch.rs):

imem_resp: IMEM response which contains the address (PC) and the data (inst bytecode).

DecR (in decode.rs):

redirect: Represents the redirection PC when the control hazard occurs.

Behavior

Each combinator do the following things:

Computes the next PC (M0-M2)

M0 (source_drop):

Receives the IMEM response and redirection PC as resolver from the later modules:
- The IMEM response comes from M2. It contains the current PC and inst bytecode.
- The redirection PC comes from the execute stage and the memory stage.
Forwards the IMEM response and redirection PC to the payload to compute the next PC.

M1 (filter_map):

Computes the next PC based on the incoming payload:
- If a redirection PC is provided, jump to it.
- Otherwise, proceed to the next sequential address (current PC + 4).

M2 (reg_fwd_with_init):

Creates a pipelined stage before accessing IMEM by storing the next PC in a register.
When the circuit is reset, it is initialized with the designated start address (START_ADDR).

Accesses IMEM (M3)

M3 (map + comb + map):

Constructs the IMEM request with map combinator.
Accesses the external IMEM module to fetch the instruction bytecode with comb combinator.
- We use an asynchronous memory for memory, it provide the response in the same cycle.
- We used attach_resolver module combinator to attach additional resolver to the IMEM.
Deconstructs the IMEM response with map combinator.

Discards on misprediction (M4-M5)

M4 (map_resolver_drop_with_p):

Attaches the IMEM response to the resolver signal for the next PC calculation.
Turns on the ready signal when control hazard occurs to extract the payload from M2.
- This allows discarding invalid PC stored in the M2.

M5 (filter_map_drop_with_r_inner):

Filters out the payload when the redirection happens.

Decode stage

The decode stage mainly do the following things:

Decodes the instruction.
Reads the value of source registers.

It can be decomposed into combinators as follows (code):

DecEP (in decode.rs):

wb_info: Writeback information which contains the writeback address and selector.
br_info: Branch information which contains the branch type, target address' base and offset.
alu_input: ALU input.
mem_info: Memory information.
csr_info: CSR information.
is_illegal: Indicates that the instruction is illegal or not.
pc: PC.
debug_inst: Instruction (for debugging purpose).

ExeR (in exe.rs):

bypass_from_exe: Bypassed data from the execute stage.
bypass_from_mem: Bypassed data from the memory stage.
bypass_from_wb: Bypassed data from the writeback stage.
stall: Destination register address of load or CSR instruction in the execute stage.
redirect: Redirection PC.
rf: Register file.

Behavior

Each combinator do the following things:

M0 (map_resolver_inner):

Constructs the ingress resolver of the decode stage.

M1 (reg_fwd):

Creates a pipelined stage before decoding the instruction.
Sends a ready signal which indicates it will be free in the next cycle.

M2 (map):

Decodes the instruction.

M3 (map_resolver_block):

Stalls until the value of source registers are visible.

M4 (filter_map_drop_with_r):

Reads the value of source registers and attaches them to the payload.
Filters out the payload when the redirection happens.

Execute stage

The execute stage mainly do the following things:

Executes the ALU.
Resolves the branch misprediction.

It can be decomposed into combinators as follows (code):

ExeEP (in exe.rs):

wb_info: Writeback information which contains the writeback address and selector.
alu_out: ALU output.
mem_info: Memory information.
csr_info: CSR information.
is_illegal: Indicates that the instruction is illegal or not.
pc: PC.
debug_inst: Instruction (for debugging purpose).

MemR (in mem.rs):

bypass_from_mem: Bypassed data from the memory stage.
bypass_from_wb: Bypassed data from the writeback stage.
redirect: Redirection PC.
rf: Register file.

Behavior

Each combinator do the following things:

M0 (map_resolver_inner):

Resolves the branch misprediction based on the branch type and ALU output.
Constructs the ingress resolver of the execute stage.
- Attaches the bypassed data, stall, and redirection PC for resolving data hazards.

M1 (reg_fwd):

Creates a pipelined stage before executing the ALU.
Sends a ready signal which indicates it will be free in the next cycle.

M2 (map):

Executes the ALU.

M3 (map_resolver_block_with_p):

Attaches the ALU output to the resolver signal for the redirection PC calculation.
Stalls until the data hazards have been resolved.

M4 (filter_map_drop_with_r_inner):

Attaches the ALU output to the payload.
Filters out the payload when the redirection happens.

Memory stage

The memory stage mainly do the following things:

Accesses memory for load/store instructions.
Accesses CSR for illegal instruction and CSR instructions.

It can be decomposed into combinators as follows (code):

MemEP (in mem.rs):

wb_info: Writeback information which contains the writeback address and data.
debug_pc: PC (for debugging purpose).
debug_inst: Instruction (for debugging purpose).

WbR (in wb.rs):

bypass_from_wb: Bypassed data from the writeback stage.
rf: Register file.

Behavior

Each combinator do the following things:

M0 (map_resolver_inner):

Constructs the ingress resolver of the memory stage.
- Attaches the bypassed data and redirection PC for resolving data hazards.

M1 (reg_fwd):

Creates a pipelined stage before accessing DMEM or CSR.
Sends a ready signal which indicates it will be free in the next cycle.

M2 (map + branch):

Computes the branch selector with map combinator.
Branches the interface into three for accessing different module (DMEM / CSR / None).

M3 (map + comb):

Constructs DMEM request with map combinator.
Accesses the external DMEM module with comb combinator.
- We use an asynchronous memory for memory, it provide the response in the same cycle.
- We used attach_resolver and attach_payload to attach additional resolver/payload to the DMEM.

M4 (map_resolver_with_p + map):

Attaches the DMEM response to the resolver signal for the bypassing data calculation.
Constructs the memory stage egress payload with map combinator.

M5 (map + comb):

Constructs CSR request with map combinator.
Accesses the CSR module with comb combinator.
- It provide the response in the same cycle.

M6 (map_resolver_with_p + map):

Attaches the CSR response to the resolver signal for the bypassing data calculation.
- It contains the redirection PC when exception happens.
Constructs the memory stage egress payload with map combinator.

M7 (map_resolver_with_p + map):

Directly attaches the payload to the resolver signal bypassing data calculation.
Constructs the memory stage egress payload with map combinator.

M8 (merge):

Selects one of transferrable egress interface of M4 (DMEM), M6 (CSR), and M7 (None).
- It is guaranteed to be processed in-order manner because the maximum concurrent instruction in the memory stage is limited to one.

Writeback stage

The writeback stage mainly do the following things:

Write the result back to the destination register.

It can be decomposed into combinators as follows (code):

Constructs the ingress resolver of the writeback stage.
- Attaches the bypassed data and register file for resolving data hazards.

M1 (reg_fwd):

Creates a pipelined stage before accessing regfile.
Sends a ready signal which indicates it will be free in the next cycle.

M2 (sink_fsm_map):

Updates the register file.
Attaches the register file to the resolver for reading value of source registers.

NPU Core (Based on Systolic Array)

TODO @woojin @gieun

Contributors

Here is a list of all the people who have worked on HazardFlow:

Current Contributors

Previous Contributors

Seunghyeon Jeong
Tanapoom Sermchaiwong