I recently came across the concept of Capacity and Efficiency Engineering, which is the discipline of designing and operating systems in an optimal way. It is a key pillar of system architecture and is especially important when scale matters, like in quantum computing systems.

At a glance, Capacity and Efficiency Engineering is the science or discipline of optimising production. It does this by balancing maximum output (capacity) with resource consumption (efficiency).

The good thing about Capacity and Efficiency Engineering (CEE) is that it is a well-defined domain, well covered in the literature, and supported by well-structured engineering practices. This memo attempts to extract a few key concepts used in CEE.

Capacity and Efficiency Engineering Pillars Link to heading

Capacity & Efficiency Engineering is built on three core pillars: Observability, Modelling, and Planning.

Observability

→

Modeling

→

Planning

The ultimate goal is to create a plan that can be effectively simulated using a model. The model should be validated against empirical observations. By modelling growth, identifying bottlenecks, and allocating resources strategically, the plan prevents overload or underutilization. This ensures optimal system operation.

Capacity and Efficiency Engineering

It should be noted that the system resource can be anything. It could range from IT resources such as CPU or memory to staff or machinery. In fact, CEE can be applied in various areas. For example, it can help optimise people allocation in project management or reduce cloud hosting costs in infrastructure management.

Observability Link to heading

Observability is the empirical analysis of a system. In computing systems, logs are often conflated with observability, but they are just one part of the picture. For example, the OpenTelemetry standard encompasses multiple observable signals: logs, metrics, traces, and baggage. For CEE, effective observability should provide:

Resource Usage: Observing and tracking resource consumption. The observation should specify what, when, and by whom the resource is used, and whether any errors occurred.
Performance Metrics: Track key performance indicators (KPIs) such as latency and throughput. Including yield, waste, or error rates.
Bottleneck Identification: Pinpointing constraints in the system that degrade throughput.

The key is to be able to perform post-mortem analysis of the system and, under all conditions—whether during heavy load, maintenance, or idle periods. It is important to understand in detail what happened, when, and why. Without such observables, it would be impossible to identify the root causes of system failure.

Of course, there is a challenge in collecting the entire observation of a complete system: Observations can be very costly, or even destructive (as in the case of quantum computing, where measuring a system can change its state—a phenomenon known as the observer effect). This means that the simple act of observing the system can alter its behaviour and degrade its performance. The answer to this is sampling (collecting data from only some events instead of all). The idea is to make the sampling rate (frequency of data collection) a tunable system parameter, as shown in the diagram below (credits: OpenTelemetry).

Observation Sampling

For a non-computing system, effective observability would enable, for example, an explanation of why a project is delayed. The observable data would likely be the ISO9001-mandated technical and non-technical documents that explain the trackable decisions made during each project phase.

Modeling Link to heading

With observability established, one can begin to understand the system’s analytical parametric behaviour. This behaviour forms the foundation for the next pillar: it can be described and translated into a parametric model. The next step is to clarify what such a model needs to provide, given a set of parameters or conditions:

Performance: What is the maximum throughput and latency the system can provide.
Operating Cost: What does it cost to operate the system, in terms of the number of allocated resources? This cost may vary depending on the time of day.
Reconfiguration: What is the transient impact (temporary effects) of updating the system parameters? For example, when allocating a new server or QPU and configuring it properly.

The key is to simulate system behaviour and observe its performance under varying operating conditions. Since the model includes resource costs, it can show cost trade-offs under these conditions. This information is important for the planning phase.

It is worth mentioning here the difference between simulation and emulation. At a high level, emulation creates a mock version of the system, while the simulation operates on an abstract model of the system. With emulation, it is possible to compare emulated performance with real performance because the inputs are the same. I assume that in the context of CEE, the model mostly refers to a simulator.

Simulation vs Emulation

A good model for a non-computing system could predict the effects of adding more resources to a project. And indeed, sometimes, adding resources actually decreases system performance. This counterintuitive result is known as Brooks’s Law.

Planning Link to heading

With the model in hand, you can begin planning system operations. Use the model to simulate various conditions and find the optimal ones. Use these results to strategically assign resources to demand. A good plan should provide demand forecasting, cost management, and fail-safes. Forecasting predicts future usage, traffic, or needs. Cost controls overprovisioning while maintaining capacity for peak times. Fail-safes provide buffers to handle unexpected demands.

The key is to operate the system efficiently and reliably. Handle peak loads without failure. Guarantee cost-effectiveness. Forecasted capacity allows proactive reconfiguration. Provide a sufficient buffer to address unexpected events.

The plan combines the best-case scenario, where demand is as expected, and the worst-case scenario, where demand is higher or lower than expected. This approach achieves a cost-optimal allocation of capacity.

Fine-tuning the model Link to heading

The Chicken and Egg Dilemma Link to heading

Designing a good model is a chicken-and-egg problem. Unless you know what to observe and how to observe it in the system, it is hard to build an efficient and accurate model. Without an accurate model, the plan is unlikely to produce good results.

When discrepancies occur between the simulated plan and real behaviour, you try to find the missing parameters. This is usually done by analysing logs from the observed system after the fact. The goal is to identify new parameters to observe.

Capacity and Efficiency Engineering

This learning loop is represented as knowledge fine-tuning in the diagram above.

The Engineering Platform Link to heading

CEE in itself would not be enough without an automated platform to support its whole workflow. This engineering platform is represented in the diagram below.

Capacity and Efficiency Engineering

The idea of the engineering platform is not only to automate the process of scaling the system up and down, which can be delegated to Ai Ops agents. It is also to continuously verify the adequacy of the simulated model against empirical observations and feed the discrepancies back to an ML agent that can fine-tune the model. This agent is likely a reinforcement learning-based agent, which should be governed if used in real time.

The cost of failing to scale fast enough, aka the cost of delay. Link to heading

For a system to be stable and robust, performance should be capped at stable levels, not maximum ones. As a rule of thumb, stable performance is around two-thirds of maximum performance. Beyond this threshold, the system degrades. The cost of downtime can then quickly—often exponentially—outweigh the cost of overprovisioning. Cost of delay

In the image on the right, the cost of delay can be considered as the financial impact of not being able to serve the demand while scaling up the system (in real-time). This is a transient cost. When utilisation is low, the delay affects only a few demands. But when the utilisation is high, the delay impacts many demands, and the cost of delay increases exponentially.

The sweet spot is the optimal utilisation that minimises both the cost of delay and the cost of not working (overprovisioning). The green curve shows the total cost as the sum of these two factors. The local minimum of this curve, usually around 2/3 of the maximum performance, marks the optimal utilisation. There’s actually a lot more to say about the sweet spot, which I am sure some will argue is around 80%. The key is to realise that reducing the cost of delay by optimising the system enables moving the optimal utilisation up (image credits: show me the data).

Optimising the system Link to heading

The cost of improving Link to heading

If lowering the cost of failure for a system can mean minimising the time it takes to scale, imagine a system that can adjust itself in milliseconds, or even microseconds, to scale up and meet demand. Of course, developing such a system can be very costly (think HFT). But, what if the cost of developing such a system was lower than the benefit of being able to operate the system at a higher utilisation?

This challenge is also part of the CEE. By modelling the cost of improving the system, it becomes possible to create a plan that balances the cost of improvement with the benefit. It is less of an operational challenge and more of a strategic investement. Yet it is still driven by the same principles of CEE applied to the business’s operational strategy, seen as a system.

Value Stream Mapping Link to heading

Value Stream Mapping (VSM), in the context of Capacity and Efficiency Engineering, is commonly used as a diagnostic tool to determine where capacity is consumed, allowing for the identification of root causes of waste and bottlenecks.

A Value Stream is every step, or activity, in the workflow used by the system to deliver a service, and the map is a visual representation of this end-to-end flow. VSM distinguishes between three types of activities:

無駄

Muda

Value-Added (VA): Steps that directly contribute to the output the customer wants (e.g. actual compuation).
Non-Value-Added but Necessary (NNVA): Steps required by the system but without no direct customer value (eg handshakes).
Waste (Muda, or 無駄, a Japanese term meaning “waste”): Steps that consume capacity without adding any value.

The goal is to eliminate Muda steps, and VSM excels at identifying them. The power of VSM lies in its ability to identify the right optimisations that can reduce work in progress (WIP). This matters because of Little’s Law, which states that the average Lead Time (or Cycle Time) is equal to the average Work in Progress (WIP) divided by the average Throughput:

Lead Time = Work in Progress / Throughput

This means that if WIP doubles, lead time doubles, even if throughput stays the same. Controlling WIP is therefore one of the most direct levers for improving delivery latency (the “delay”) and therefore the system efficiency.

Conclusion Link to heading

Voila, this short Sunday morning memo on Capacity and Efficiency Engineering (CEE) helped me to better understand the topic. For now, what I want to remember is that, when a system yields very low efficiency, it is very likely a sign that the system, or its workflows, are overwhelmed with Muda. The answer to this challenge: invest in reducing the cost of delay rather than by over-provisioning?

References:

DrawIO diagrams used in this memo:

.drawio .webp .svg
capacity and efficiency engineering

.drawio .webp .svg
obervation sampling

.drawio .webp .svg
optimial utilization

.drawio .webp .svg
capacity and efficiency engineering pillars

.drawio .webp .svg
simulation vs emulation

Curious about the relative perspective of various AI agents on CCE? Here is the summary:

Dimension	Gemini	ChatGPT	Claude	Copilot
Primary lens	People & teams	Systems & infra.	Industry & operations	Reliability & scale
Key methods	VSM, WIP limits, Agile, CI/CD	Load testing, stress testing, metrics	Lean, Six Sigma, continuous improvement	Stress testing, bottleneck analysis
Unique angle	Burnout & work-life balance as a metric	Over/under provi-sioning as core risk	Traditional industry methods (no tech slant)	“Prevent bottlenecks before they happen”
Metrics emphasis	Cycle time, deployment frequency	Throughput, latency, cost per request	Utilization rates, cycle times	Workload forecasts, scaling thresholds