by Ian Evans, Constraint Technologies International.
Presented at the AGIFORS Airline Operations 2007 conference in Denver, Colorado on 22 May 2007
Constraint Technologies has been working on the schedule and crew recovery problem for over 5 years. This paper examines some of the important issues in the development of a recovery optimisation component and its application to a range of different airlines with different types of problem. These include justification, philosophy of operation, integration with crew systems and crew recovery, and software architecture.

This presentation outlines the experience that Constraint Technologies (CTI) has had with schedule and crew optimisation over the last 5 years. When we started 5 years ago, we were attempting schedule recovery while only considering equipment, maintenance and passenger constraints. Since then we have encountered different types of problems that have shown that more sophisticated techniques are vital in many cases. These techniques consider a much greater range of constraint types.
I will also be talking about some future directions that we're working on - that is, what needs to be done at the planning stages to make recovery easier.

One question is: Why do we want recovery optimisation in the first place?
The obvious driver for optimisation of schedule and crew recovery is the large gap that exists between planned and actual values for key parameters such as crew and aircraft running costs. There is also a strong desire to improve indicators such as the number of departure delays and cancellations so as to improve the quality of service as perceived by the customer. Extra running costs cost money directly, while poor customer service can cost money indirectly.

Many current operations control systems provide excellent tools to help the operations controllers decide on good strategies to recover from disruptions. Even with the best decision support tools available, however, finding good solutions to large disruptions is extremely challenging due to the complex nature of the interactions between aircraft, crew, passengers and maintenance.
Furthermore, humans tend to rely on rules-of-thumb when calculating costs, but these can break down in certain situations, leading to the selection of suboptimal solutions.
Lastly, it requires significant aptitude and experience to become a good operations controller, and this can expose an airline to the risk of increased costs if such experienced staff retire or leave.

Our experience to date investigating schedule recovery for a range of airlines has made us realise that some problems are harder to solve than others, and that this is also affected by the type of network.
This table is an attempt to broadly classify this difference. This will be explained in more details in the next few slides.

Simple networks make recovery easier in many ways. Hubbing operations where the crew base and maintenance are at the hub mean that it is easy to isolate a disruption to a single pair of outward and return legs, and there are many options for swapping to reduce the impact. Having a small number of aircraft types and having cross-qualified crew again increases the number of options for recovery.
As soon as there are many passenger or, especially, crew connections between aircraft then the situation becomes more complex. It's a double effect - firstly, any disruption to a leg is more likely to have flow-on effects and, secondly, there are more constraints on solutions so that it is harder to recover.
Another effect can occur when crew have to travel away from their base for training purposes. Any disruption to legs being operated on the way to training could affect the training program, and this could cause long-term flow on effects. This can be especially severe for long haul operations where the cost of paxing crew to training is high.

To give some idea of the sort of complexity that can occur with crew basing, here is a simplified example from a few years ago of a long-haul network showing the location of crew bases. The width of the lines represents the capacity on each network segment. It can be seen that there are some crew bases that are at quite inconvenient locations so that disruptions to crew pairings can be very costly to recover from.
In this example, the flying had changed markedly in the period since basing decisions had been made. This highlights the point that crew basing decisions need to be made cautiously, taking into account scenarios that may occur in the following years. Of course, in an industry as volatile as the airline industry, this is easier said than done.

Another difference that is perhaps not immediately obvious is the difference between short haul and long haul. Again, there are multiple factors that work together to make long-haul problems harder to solve.
Firstly, crew often need a long rest after a single leg, so if a leg has to be diverted due to a port closure then there is likely to be a significant disruption.
Secondly,disruption costs are high due to the larger number of passengers and crew on long-haul aircraft.
Finally, to make things worse, long haul passengers faced with significant disruptions are less likely to cancel trips than short haul passengers, so controllers can't just assume that everything will sort itself out overnight.

I'll just restate what we consider to be the definition of the schedule recovery problem. This is to return to schedule as fast as possible while minimising passenger impact, crew impact, maintenance impact, aircraft costs and the number of changes. In some situations freight impact also needs to be considered.

The definition of the problem given in the previous slideis a long way from being able to be stated as a precise mathematical objective function. There are two main problems:
The relationship between disruptions and costs is not clear, and can change according to volatile market conditions - for example passenger delays may be more important in the face of a competitor's advertising campaign about better on-time departure statistics.
There is just about always some information that the optimiser doesn't know about.

CTI's concept of recovery optimisation, then, is that it is a decision support problem. Solutions must be good, but if the definition of "optimal" isn't clear then it's more important to provide a range of good solutions, and then to provide tools for the operations controller to select the most appropriate of these.
What does need to be assured, however, is that the solutions are feasible for all factors that are known to the optimiser.

The basic idea that we have is to run the optimiser to generate several solutions. Solutions can then be sorted by various criteria, and any with criteria clearly outside the acceptable range can be discarded.
This then gives a small subset comprising the best solutions, where "best" is defined by ranking criteria selected by the operations controller.

The remaining solutions can then be compared using various criteria - here the spread of passenger delays is shown.

Here the solutions can be ranked according to a range of user-selected criteria.
Any solution can then be loaded, evaluated, manually tweaked if necessary and then published once it is acceptable.

Given that we want several solutions, the question is how we ensure that they are good solutions. One way of doing this is to use optimisation to generate the best solution, then add constraints to disallow this solution to generate the second-best solution etc. A range of solution flavours can be obtained by varying the weightings used for costing of different disruption types.
This approach has the advantage that it can obtain good solutions to difficult problems, but it has the disadvantage that it is quite slow to obtain a large number of solutions.

A way of obtaining a large number of solutions is to perform a local search for solutions with costs within a certain bound of the current best solution.
This generates many solutions rapidly, but it relies on starting at a feasible solution. It may remain locked in a solution space near the initial solution, missing out on some better solutions with a different "flavour".

Our current preferred approach is to combine the two previous techniques so as to get the best of both worlds.
This technique generates the initial feasible solutions using optimisation techniques, then uses local search to generate a range of solutions near these local optima.

Operations controllers can only check the feasibility of a solution up to a point. Other departments have more detailed information that needs to be checked.
In our model, the operations controllers would select a small number of solutions (maybe 3) that seem to be the most promising, then circulate these for comment to other departments that are likely to be affected. The best solution that has no major objections from any other departments would then be chosen to be published.

This shows the architecture of our approach.
Data from all systems is available for consideration by the fleet operations controller and the schedule recovery optimiser, but at a reduced level of detail compared with the individual systems.
A small number of solutions is selected by the fleet operations controller and sent to the individual systems for detailed checking.
A solution with no major objections from any party is then published.

Our model requires that solutions are feasible, and our experience has shown that there are many checks that must be performed to ensure feasibility. Along with this, it is important to be able to model solution costs adequately in order to be able to select the lowest cost solutions.
The number of required checks is challenging for two reasons:
- The data required for the checks must be available, accurate and up-to-date. This is a considerable integration challenge.
- The optimisation algorithm chosen must be able to handle the large number of different constraints and costs while providing results in a very short timeframe.

The area of crew constraints is a good example of the difference between the level of detail handled by the schedule recovery optimiser and the crew recovery optimiser.
Broadly speaking, the schedule recovery optimiser works within the limits of each duty as currently planned (allowing extensions if the duty is in progress and disrupted), and assumes that changes to the flying within and outside that duty will not affect the limits. This is normally valid for any current duties, but complex interactions between duties within the crewing rules may mean that there are downstream effects that are not adequately modelled within the schedule recovery optimiser.
Another simplification is to assume that standby crew all have similar qualifications, where in reality some might have extra restrictions or qualifications than others.

Continuing with the example, the crew recovery optimiser takes detailed crew information into account to come up with accurate costs of solutions proposed by the schedule recovery optimiser.
Note that even the crew recovery optimiser may not know all information required to check feasibility. An example would be if extensions to crew duty in the presence of disruptions are only allowed at the captain's discretion. In this case, the only way to find out if a given duty is legal is to ask the captain.

Perhaps the most difficult part of recovery optimisation is convincing an airline that it is worth implementing. In the modern business environment, any new system must have a business case made to justify the cost of implementation.
Many studies of the costs of disruptions use an overly simplistic model to calculate these costs - for instance taking figures for the number of minutes of departure delays for a sector multiplied by the number of passengers and crew on that sector. This is normally due to a lack of data that could be used in a more realistic model - figures for delays and numbers of passengers on a leg are easy to come by, but the real story is much harder to work out.
Real figures for the number of passengers affected by a disruption need to take into account the number of passengers who were initially booked on that flight - or at least the proportion of those booked on that flight who were expected to arrive. It also needs to take into account passenger connections. This data needs to be captured at the time of the disruption, as final passenger figures don't include passengers who cancelled their trips or transferred to other carriers or modes of transport.
Crew costs can be estimated separately by taking the total additional crew costs at the end of the month and subtracting those caused by sickness. This technique does not give information about which disruptions caused the extra costs however. To obtain this information it is necessary to be able to separately estimate the extra crew costs at the time of each disruption.

Since a schedule recovery optimiser must be able to estimate solution costs in order to find the "best" solutions, we have also realised that we can use the same functionality to be able to calculate disruption statistics and costs for all published solutions - both manual solutions and those from an optimiser.
Saving the statistics from all published solutions allows reports to be generated giving the statistics and/or costs over any desired period.

Even if the method of calculation of real costs from disruption statistics is open to interpretation, being able to have accurate reports such as this of disruption statistics can help provide visibility of the actual amounts of disruption occurring.
Naturally, the report can only put an upper bound on the savings possible using recovery optimisation, as an optimiser will not reduce disruption to zero. However, the amount of savings can then be estimated by taking a small number of case studies to see what percentage reduction could have been achieved, and using this percentage along with the total disruption figures to provide a good estimate of the total savings.

One thing that our experience has shown is that it is very hard or impossible to find good solutions in some situations.
In some cases this is unavoidable, but in others it seems that it would have been much better if the schedules had been more robust in the first place.

In many airlines, there seems to be a disconnect between initial planning, which is optimised to within an inch of its life, and the schedule flown on the day of operations.
Between these two stages there is often another stage designed to maximise revenue, but often this is done without adequate consideration of the extra costs involved.

There has been ongoing work by many groups in the field of robust planning to try to ensure that planned schedules are robust in the face of disruption. Some of this work involves integration of separate planning steps to ensure that the result is better than that achievable from sequential separate optimisation. This requires changes to business procedures that may be difficult in some situations, but more importantly the robustness of the plan can be reduced by subsequent schedule changes for revenue maximisation.
A different and perhaps more powerful approach is to use feedback in all stages of the planning process via statistics of the likelihood of disruption. This does not require changes to business processes, and can also to some extent cater for "disruptions" that typically happen during the revenue maximisation processes.
No matter what is done, it is important to keep a careful watch on increased costs during the revenue maximisation stage. What is wanted is to maximise revenue minus cost, not just revenue. There are many similarities to recovery optimisation here - the main difference is that the "disruption" is intentional, and thus several different alternatives can be explored in order to minimise the resulting costs.

Another important way to minimise recovery costs is to make strategic decisions that result in schedules that inherently minimise the impact of any disruption. Important decisions such as minimising the number of aircraft types and keeping crewing simple can increase the robustness of a schedule significantly.
In most cases, complexity has been added after a business case has been made showing that the extra complexity will lead to savings. Often, however, the costs of reduced robustness have not been included in this business case. This is normally because of the difficulty in estimating the costs, rather than because they don't exist. An important long-term research strategy at CTI is to develop validated models to allow these costs to be estimated for any given scenario.

CTI is currently working with Monash and Melbourne Universities to establish a centre of excellence in optimisation technology for the TTL sector.
Current research projects are in the fields of integrated planning, robust planning and robustness cost modelling.

To summarise, the lessons that CTI has learned are that recovery optimisation should involve the generation of several good, feasible solutions followed by the selection of the best of these using information from all systems.
We are also working on strategies for reducing the likely impact of disruptions. These strategies include the use of robust planning, the evaluation of the cost of all changes made during revenue maximisation, and modelling of costs of strategic decisions to allow more accurate business case costing to be performed.
Further Information
You may wish to look at the PDF version (676K
) of
this document.