Runtime Clients
A user can partition their application into multiple STM clients. Every STM client resides in a separate OS process and links to the shared library libstm runtime.so. Clients register resources and runnables with the libstm runtime.so, which creates a mapping between runnable/resource names and their process local addresses - function pointers for runnables and resource pointers (ex. CUDA streams) for resources. After the registration phase, clients then yield to libstm runtime.so by calling stmEnterScheduler(). This function enters an infinite loop that schedules a process runnables via the previously registered function pointers periodically until termination. To exit, clients can either pre-specify a maximum number of hyperepoch frames to execute as a parameter to stm master, or they can call stmExitScheduler() from within the application. For example, the application’s SIGINT or SIGTERM handler might call stmExitScheduler() to gracefully exit from the blocking stmEnterScheduler() call. STM clients follow the following initialization sequence:
- An external entity (the init system) starts the STM Master and all clients. STM does not require that the master and clients be started in any order.
- The STM Master will load and verify the schedule that was passed to it as a command line argument.
- Each client will initialize and register with the STM master. They each follow the following process:
- STM begins executing runnables according to the schedule.
Requirements for modeling a workload as an STM Runnable
To be modeled as an STM runnable, the workload must satisfy the following requirements:
- Each runnable must run on an independent hardware engine. Synchronous workloads are not permitted - they must be reconstituted as CPU runnables that submit another runnable - the async engine workload - to the engine’s queue. The runnable that submits the workload is not allowed to wait for the completion of the submitted workload. Instead, another runnable can depend on the submitted workload and can wait for the submitted workloads completion.
- Any synchronization introduced by the application in a runnable must be captured by the DAG fed to the STM Compiler.
- Only control-flow/timing errors will be detected/handled by STM. Data errors will be ignored - they must be handled by the application runnables.
- To ensure determinism, memory allocations and deallocations must be completed before entering the scheduling loop, i.e. before calling stmEnterScheduler()
- Host memory required for GPU transfers must be page-locked.
- All CUDA Streams used by the Application must be registered with STM, and must only be used within STM Runnables.
- The following synchronous operations are not permitted within an STM Runnable because their use can cause deadlocks/utilization bubbles:
- The use of the default CUDA Stream. All CUDA streams used with STM must be created as a non-blocking Stream.
- The use of cudaDeviceSynchronize() and cudaStreamSynchronize()
- The use of prints and asserts within device kernel code
- The use of CUDA Unified Memory
 
- STM guarantees synchronization between Submittee runnables and its children. However, if additional synchronization is required, the use of cudaEventSynchronize is recommended as it provides finer-grain synchronization.
- A runnable shall not launch any threads.