# WeightedControlScheduling

Abstract — This paper describes a practical technique for the opti- mal scheduling of control dominated systems minimizing the weighted average latency over all control branches. Such a weighted metric is cru- cial for control dependent scheduling to accommodate practical archi- tectural goals. In contrast to most weighting mechanisms, a non- Bayesian probabilistic measure is used to avoid assumptions of branch independence. The underlying scheduling model allows general FSM- based models for operations, captures several forms of speculative exe- cution and scales well with increasing control complexity. I. INTRODUCTION Scheduling is a well studied problem with applications in a wide variety of fields. Of particular interest are applications in High Level Synthesis (HLS), where a behavioral specification is synthe- sized to a cycle-accurate one. Scheduling is a hard problem, being NP-complete even for resource-constrained, acyclic DFG’s. Optimal scheduling techniques have been restricted to relatively simple per- formance metrics in which worst-case or total latency is optimized. For systems with multiple alternative behaviors or control paths, such metrics can dramatically fail to find sensible solutions. For example, consider a small application specific processor (ASIP) with forwarding, exception handling and queued I/O. A conventional scheduler attempts to minimize the worst case sequences – which might occur only at very low likelihood – to the exclusion of all other costs. A scheduler minimizing every control path separately may miss operation sharing opportunities and so increase the power consumption and design complexity. Instead, an intelligent perfor- mance optimization is desired, in which the performance of key instruction sequences (limited by available resources and con- straints) are balanced to meet an appropriate weighted metric. This paper deals specifically with control-dominated data-flow scheduling representing systems with weighted control paths. The data-flow model is acyclic, providing operand based fork and join controls that only affect data dependencies. Data-flow loop restruc- turing is not addressed, however, general finite-state models for operations are supported. In contrast to current branch probability models, which assume independent probabilities, a non-Bayesian weight model is used. Thus, potentially correlated weights are assigned to subsets of control alternatives. This is important in order to fit the metric to simulation and performance modeling results, which have consistently shown that branch probabilities are highly correlated for branches with temporal locality. Such a model has the issue that the set of weights must be treated as a list instead of a computation tree as in the independent case. Because of this require- ment, the problem specification can grow exponentially fast with control complexity. However, in practice, the cost of adequate speci- fication is bounded by the accuracy of the simulation analysis and so is controllable. Given an acyclic control and data dependency specification, a bounded set of FSM-based operator models which map to the speci- fied operations, and a disjoint set of control path weights (which are taken to represent probabilities), a new symbolic algorithm is pro- posed to generate all minimal weighted latency schedules of opera- tions. Specifically permitted in the analysis are all forms of speculation past branches and branch re-ordering, as well as support for run-time or non-deterministic branches. The model does not allow implicit speculation past variable join points. This technique provides a compromise between implicit scheduling complexity and classical CDFG re-write techniques such as strength reduction, join speculation, and loop unwinding which are not addressed here. Moreover, it provides a bounded state model for scheduling which greatly reduces the complexity of optimal search. The remainder of the paper is organized as follows: Section II summarises previous work on this and closely related problems. Section III provides a brief overview of the symbolic scheduling model. Section IV outlines the problem formulation. Section V elab- orates the algorithm details, and provides a proof of optimal con- struction. Section VI provides experimental results, and the paper concludes with Section VII. II. PREVIOUSWORK Boolean symbolic scheduling techniques supporting control were introduced in [2] and greatly improved in [3]. The first approach, [2], is somewhat akin to an integer linear program (ILP) formulation in that bits are assigned denoting completion of every operation at every time step. Unlike ILP, however, the boolean for- mulation allows for a relatively inexpensive exact model of control dependent behavior. The technique was inefficient for long schedule lengths, as complete histories were maintained. Ref. [3] presented an exact automata based approach which addressed the issue of lengthy schedules by encoding merely the completion status of operations rather than complete path histories. This was shown to work on a number of large problems, and further to support practical con- straints on the control and sequencing of the components. One prob- lem with this technique was that large numbers of possible schedules were equivalent using their metric, leading to poor run-time (and space) performance on under-constrained problems. Recently, Cabodi et al. [5] reformulated the problem as an instance of Bounded Model Checking (BMC) to generate single schedules. This technique gains performance at the cost of additional control depen- dency overhead. Manolache et al. [9] consider the different problem of scheduling tasks with stochastic delays, and derive efficient heu- ristics based on approximations of the distributions. The AFAP or path-based scheduling scheme in [10] minimizes individual control path lengths and tries to construct a controller with minimal control states. The technique has strong restrictions on operators types, and further requires that every path be completely ordered prior to scheduling. In [11], the preceding method is extended to allow operation re-ordering, but only within basic blocks. Neither method supports speculative execution. Minimizing each path length is better than ignoring short paths. However, all control branches need to share the same resources. Given potential control restructuring and speculation, the number of operations and opportunities for function unit sharing are not fixed. Such greedy Weighted Control Scheduling Aravind Vijayakumar and Forrest Brewer Department of Electrical and Computer Engineering University of California, Santa Barbara, U.S.A. {aravind, forrest}@ece.ucsb.edu path length reduction will find fast solutions only at the cost of excess resources. Bhattacharya et al. [7] introduced expected execution time as a metric, and proposed a heuristic scheduling technique targeted toward reducing it. Lakshminarayana et al. [8] proposed an algo- rithm with a control representation exposing more parallelism to minimize the expected execution. Both techniques are heuristic, and rely on modeling the branch weights as essentially independent (assuming Bayesian independence in the weights). Dos Santos et al. [1] use average execution time to guide a code-motion based exploration. Though their method guarantees that at least one optimal solution is always present in the search space, there is no assurance that it will be found. Finally, Kountouris and Wolinski [6] describe a list-based scheduler that uses a priority function based on probability of execution. The control model used in [6] is similar to that of [2], [3] and this work in that no explicit fork nodes are placed, avoiding the ad-hoc clustering of operations into basic blocks. III. SYMBOLICSCHEDULING Automata based symbolic scheduling was introduced in [3]. This section briefly covers the terminology and basis of this tech- nique. A detailed discussion of automata based scheduling tech- niques can be found in [4]. A. Terminology The following terms are used in the rest of this work with the stated meanings: ? A control operation is one that produces one or more binary control values (control bits) which represent a forking of the execution into independent behaviours. ? A control cube is a Boolean cube consisting of multiple control bits, representing a set of branch decisions. A cube may be asso- ciated with a weight, representing the probability of that set of decisions. ? A control cube covering exactly one behavioral sequence of the system is termed a control case. ? A trace is a scheduled instance of a control case, consisting of a complete ordering of all operations required for the case. ? An ensemble schedule is a valid, causal collection of traces cov- ering every control case. ? Compatible traces are traces for different control cases that may co-exist in a given ensemble schedule. ? A resolve label is a Boolean function associated with each con- trol operation. It is True only in the cycle when the control resolves (The value becomes available to the system and con- troller by the end of that cycle). B. Operator model Operations are modeled as small NFA (non-deterministic finite automata) which encode the operation’s temporal input and output behavior. Fig. 1 shows a few example operator models. Arcs are labeled by the input requirements, a 1 indicating that an operand must be available in that cycle for the transition to take place. A sin- gle-cycle operation that takes one input can be modeled as in Fig. 1a. Note that the non-determinism is used to model the unknown start time of the operation. After it starts, the model executes in a single cycle and ends in its final state. Typically, the operators are encoded with an all-zero start state, and the termination is modeled by a dense (close to all-1s) cube. This choice of encoding heuristically reduces the size of the representaton. A single cycle control operation is shown in Fig. 1b. Execution traces are represented by unique sequences of states of the modeling automaton. Since distinct control cases can have distinct traces, every control bifurcation has a corresponding bifurcation in the modeling state space to represent alternative traces. These traces are distinguished by a bit (boxed) whose value indicates the branch selection (true or false path) in the simple single cycle operator model of Fig. 1a. Much more complex operators may be created to model pre- scribed sequential behaviors. For example, Fig. 1c shows a non- deterministic 2/5 cycle latency cached memory access. To initiate the access, one operand is needed for the first transition (the address). Since the hit/miss value is known only at run-time, the model has to correctly account for both cases. Thus, this operator behaves exactly like a control operator producing a dynamically computed decision, and correspondingly provides bifurcating states for each case. Fig. 1d models an operation that initiates an abort sequence if it fails to receive inputs in a timely fashion – as the arc labeling indicates, it requires its second input operand to arrive on the third cycle after the first operand. Such complex operator models are required to provide abstrac- tions of subsystem behavior such as operations forced to execute on a reconfigurable pipeline which is a fixed requirement of the design. In general, an “operation” models the sequential production of a data object which is taken to be subsequently available in the system to satisfy data-flow dependency requirements. C. CDFG model The FSM inputs (the arc labels in Fig. 1) controlling the model- ing automata are derived from specification control and data-flow dependency relations. Thus, the enabling input to an operation comes from the completion states of its ancestors via data-flow dependencies. Data availability is simply modeled by the completion 1 0- - - - 11 0 1 - - 1 011 s0s1s2s3s4 abrt0 abrt1 abrt2 - -- - - -- - - - 1-- --1- - -0 s0- s1Hs2H s1Ms2Ms3Ms4Ms5M - - - ---- - 1 1 Figure 1. FSM operator models b. Single cycle control operator c. Cached Memory Access Model d. Sequential operator with abort sequence a. Single cycle operation C a0 b0 a1 b1 Figure 2. Small CDFG Soft dependence Strict dependence of a data-available state of the modeling automata. Control depen- dence is enforced only on operations whose inputs are selected from alternative branches (joins) which are distinguished by one or more control bits in the modeling state. The branch (fork) in Fig. 2 is implicitly captured using soft dependencies on the operations. Soft dependencies only modify the resource constraint mechanism of the schedule: an operation may be started as soon as its input operands (hard dependences) are valid even if it is not known to be necessary. Since control operations are, themselves, subject to this rule, arbi- trary speculation past branches and branch re-ordering are implicitly performed. The control marking of branching traces using soft dependences allows for optimal interpretation of mutual indepen- dence and resource sharing in the model. Note the strict dependence between the control operation and the join which does prevent spec- ulation past the join boundary as mentioned in the introduction. Join dependence is enforced to prevent an explosion in the number of states since an operation with a different selected operand is effec- tively a different operation. D. Scheduling and Validation The scheduling problem is solved via an implicit reachable state traversal of the modeling automata. The traversal starts from a state where only initial operands are known and ends when states encoding pre-specified termination conditions are encountered. Ini- tially, a series of forward image computations performing breadth- first search of the state space determines the quickest state sequence paths to termination states. Once terminating states are reached, backward pruning is performed to cull those states (and transitions) that do not lie on minimal time paths. During this backward prune, a validation fixed point computa- tion is performed to ensure that every trace is part of some ensemble schedule. To see why this is necessary, consider the small CDFG of Fig. 2. Suppose one resource is available to execute the control oper- ation, and one for all the others. Every schedule for this flow consists of two traces, one for each of the possible control values. As sched- uling proceeds, terminal states for both paths are reached in 2 time cycles, but there is no causal ensemble schedule of length 2. This is due to the fact that the early termination of each path relies on hav- ing speculated differently on that path. For example, to complete the 0 path in 2 cycles, the execution of operations a0and C must occur in the first cycle. Since the value of the control operation will not be available until the end of the cycle it executes in, operation a0is s