Oracle-based Optimizer#

The Oracle-based Optimizer is a hypothetical PorcE dynamic optimizer which uses an oracle which provides information about tasks (including future tasks). The oracle can provide at least the following information:

  • The dependencies of a task
  • The time a task will take to execute
  • The number of times a specific task will be spawned
The goal is to prevent the oracle from subsuming the optimizer by directly answering questions about the best optimization choices. Instead, the oracle provides precise information about tasks that could be measured if we neglect the fact the task hasn't run yet. In addition these optimizers do not make any attempt to reduce the overhead of the optimizer and recompilation itself. They assume the optimizer and compiler and magical like to Oracle.

This optimizer will serve as a goal on the horizon, but a real implementation will need to approximate the oracle using efficiently measurable data and assumptions.

Note: Optimizer parameters that may need tuning or may need to be computed from the runtime environment are written in monospaceCamelCase.

Current Optimizer (as of Nov 2017)#

Given access to the oracle the current spawn optimizer would make decisions as follows: When a task is about to be spawned, if the task will take more than threshold nanoseconds then spawn it, otherwise inline it into the spawner.

This has turned out not to work particularly well because it will inline small tasks (there by delaying the spawner) even if the task will be called repeatedly. This sequentializes the program entirely which almost always hurts performance a lot.

Batching Optimizer#

This optimizer attempts to combine small tasks into batches and avoids sequentializing by running multiple batches in parallel.

When a task may be scheduled the optimizer collects all the similar tasks (meaning tasks which have the same implementation but potentially different environments or parameters) which are not dependent on each other. This provides a set of similar tasks which can be executed concurrently.

  • If the total execution time of all the tasks is less than inlineThreshold then each task is individually inlined into its spawn site. This handles inlining sequential chains of tasks as well as inlining very fast independent tasks where the overhead of spawning even a single batch out weighs the advantage of parallel execution.
  • Otherwise, the set of tasks is divided into a set of batches such that there are at most numberOfBatches batches and each batch takes at least minimumBatchTime seconds. Then each batch is spawned as a macro-task. numberOfBatches is probably based on the number of available processors. This case handles cases of wide fan-out from both publications into branch and recursive functions which call a single task repeatedly (such as map). It partitions the tasks across processors in much the same way as parallel collections libraries. Note that this case will also applies to single long task spawns; with a batch size of 1.

This optimizer is effectively making 2 decisions: Should the task be inlined, and if not, how should the tasks be collected into efficient batches. It would be possible to inline the execution of a batch, but it is highly unlikely to improve performance since it would require building the batch data structures even if there was no spawn.

Latency#

It is possible for this optimizer to delay a task so it can be put in batch that will not exist for a significant amount of time. This doesn't affect the untimed Orc semantic correctness, but it does affect timed Orc semantic correctness. To address latency the optimizer could take into account the amount of time between the first spawn in each batch and the last and make sure batches are at least a certain length. More specifically, the optimizer would collect a set of independent similar tasks as above, but would only include tasks which can be reached by a critical path shorter than additionalLatencyLimit seconds. This will bound the latency and therefore bound the magnitude of the timed semantics violations.

Nested Parallelism#

This optimizer may not handle nested parallelism well since the batching and spawning decisions are made for every set of similar tasks independently. Ideally parallelism should be introduced as soon as there is enough parallelism available to fully utilize the processing resources and then no new parallelism should be introduced after that point to avoid overhead. With ideal batching, no spawning is ever needed within a batch since all batches will run for the same amount of time. However, even with perfect knowledge, batching is not always ideal since their may not be enough tasks to utilize all resources: either because there is no enough work to overcome the overhead for a large number of processors, or because there are simply not as many tasks as processors. If we take into account non-ideal batching we will need to somehow keep track of how much parallelism has been utilized and for how long and then only create new parallelism to fill the gap.

System Load Sensitivity#

The optimizer could respond to system load by changing the amount of parallelism it maintains. This would require monitoring system CPU utilization and then generating parallelism to fill only unused or under-used CPUs and tracking when PorcE expects it's own utilization to drop. This last part is required so that estimates do not treat PorcEs own utilization (event the dependencies of the current task) are treated as in use CPU. Effectively, PorcE would need to track how many CPUs are in use by non-PorcE tasks and would not try to fill those CPUs.

Overall, this seems like it might be of limited usefulness, but it might provide a much more efficient implementation or complement to nice.

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-7) was last changed on 18-Nov-2017 15:19 by Arthur Peters