Cause and Corrective Action -- Engineering an Escape from Man-Made Disaster
by Laurence B. Winn
Unexplained field failures are the bane of new product launches. This is especially true when the product is sold based on its theoretical reliability, but its performance in practice falls short.
A Case Study
In the case of one market incident, a large fraction of gas turbine engine production failed catastrophically within one year of commissioning. That was far short of the anticipated five- to ten-year life expectancy. Every engine in the group showed a specific burn pattern on the shaft in the location of its seized gas foil journal bearings, but the similarities ended there. Mixed in the lot were builds with parts from different manufacturers, bearings with different design details, and some components which had been remanufactured, while others were new. A naive physical examination of the components of all the engines failed to find a common denominator. The circumstances required a more rational approach of the kind that good statistics can bring to the table.
Fortunately, electronic monitoring of all the engines in the field had permitted the collection of clean time-to-failure data. The analysis of that data was of a type called Weibull that permits the use of small data sets, and can give information about the number and type of failure modes. When the failures were divided into groups according to the analysis, patterns emerged.
Statistical Analysis Results
One group, consisting of half the failures, conformed to the pattern of infant mortality, which means that most of the failures occurred almost immediately after commissioning. The bearings that supported the high-speed rotating assembly in this group were all of the same recently-introduced type. When the manufacturing and materials teams examined bearings sampled from the factory process stream, they found cracks in some of the elements, caused by the forming process. Corrective action, at least for the moment, consisted of reversion to the former bearing type.
The analysis identified a second failure mode, numerically about a third of all, in which most of the failures occurred at close to the same time in service. Usually termed a wearout failure mode, some people compared it to hitting a wall.
After additional engines had accumulated in our morgue, a third failure mode emerged from the data. It showed a more normal distribution, as if the design life had been reduced by something in the environment that affected all of the engines in the group equally.
At this stage, we had enough information to associate a type of failure mode with each failed engine, and we had identified the physical cause of one of them. Identifying the underlying physics of the remaining two failure modes turned out to be something of a challenge.
Cause and Corrective Action
Building on existing knowledge, we created a Failure Modes and Effects Analysis (FMEA) that incorporated both design and manufacturing potentials in a single matrix, but otherwise followed Automotive Industry Action Group (AIAG) rules. The result of an FMEA is a array which identifies the potential failures associated with each system, subsystem, and component, along with their likelihood, severity, and ease of early detection. When the process was complete, we were able to identify four leading suspects for the remaining two failure modes, not counting interactions. Pareto charting was ultimately helpful, although its use, and that of Quality Function Deployment (QFD) tools, met with much resistance from some members of the team. That, however, is another story.
The fattest targets in our war plan turned out to be rotor imbalance and cooling flow issues. Rotor imbalance was a particularly thorny problem because it involved both in-house and vendor processes. Third and fourth items, the bearing dynamic characteristics of stiffness and damping, had never been measured, even approximately, in any satisfactory way.
Process mapping and repeatability and reproducibility (R&R) measurements in the balance room convinced us that the uncertainty of our balance measurements had been at least an order of magnitude greater than the design limits. Further, there was a strong dependence on assembly technique, and a correlation between a transfer of balance room personnel and an increase in production with the onset of field failures.
The physical clues that pointed us in the direction of a secondary cooling flow problem were discoloration of a heat shield adjacent to the turbine, and (by process mapping again) conceptual errors in the way the compression of an important static seal was measured. A combination of finite element analysis (FEA) and computational fluid dynamics (CFD) led us to the conclusion that leakage past the seal likely elevated the temperature of the turbine bearing coating beyond its design limits.
Follow-up testing, planned with the Design-Expert software tool by Stat-Ease, gave us 75% confidence that unintentional balance errors of the magnitude we measured would result in vibration sufficient to cause progressive bearing damage, culminating in failure (but still passing our acceptance test!). We could not check the high temperature levels predicted by our analysis due to a lack of nonintrusive instrumentation.
Because the rotor speed, at nearly 100,000 rpm, pushed the inside of the balancing envelope, the original specification had required both component and group balancing. The complicating factor was that the design of the turbine required disassembly of the rotor after balancing, with the potential for an out-of-balance condition arising from reassembly during build. Moreover, the rotor design called for a cost-effective, but potentially damaging, radial interference fit between the shaft and both rotating aerodynamic components (a compressor impeller and a turbine rotor).
Repeated imbalance measurements demonstrated that some rotors sustained more damage from reassembly than others, and the differences could be extreme. So we designed a process that eliminated the worst components from the production stream.
Conclusions and Recommendations
The final report placed each failed engine in one of three failure mode categories, with physical causes assigned. It contained three corrections corresponding to each of the three failure modes:
(1) Replacement of the bearing type that experienced manufacturing fatigue, producing the infant failure mode.
(2) Dynamic balancing process changes and a minor component design change to control the extremes of imbalance that produced the wearout failure mode.
(3) Heat shield quality control improvements that eliminated the hot gas leakage responsible for the normal failure mode. This was the failure mode with the most normal distribution, produced by shifting the start/stop limitation of the bearings to the left.
Additional recommendations addressed the possibility that some of the failure modes were actually interactions between influences we could measure and those we could not. We suggested:
(1) Since an inability to measure bearing dynamic stiffness and damping, even as an approximation, had significantly handicapped our investigation, we made specific recommendations for a magnetic-bearing-supported quasi-dynamic rig that could at least collect data correlated to the sought-after properties.
(2) Our investigation had made it clear that inexpensive design and fine balance requirements were incompatible with the engine assembly process, which required that the rotor be taken apart and reassembled during the build. We recommended the resurrection of an earlier project to perform in-place fine balancing of the rotor after build.
(3) Naturally, quality control procedures implemented to manage heat shield leakage could not satisfy the long-term need. As follow-up work, we suggested specific points for a redesign of the heat shield to eliminate the root cause.
In the authors experience, most projects to determine a cause and corrective action for field failures do not end well. By luck and skill, this one did. Obviously, it is best to avoid failures by scrupulous attention to design details, guided by a thorough FMEA, and capped by a well-designed test program. When engineering judgment and common sense fall to marketing necessity, however, the expert use of statistical tools, as illustrated here, can save the product. After roughly a year of work, all of the failure modes we addressed had vanished from field experience.