VLSI Design

IC reliability and failure mechanisms

The creation of integrated circuits is a very responsible task as all the systems should work sustainably during billions of cycles. One of the challenging factors for integrated circuit operation are aggressive and extreme temperatures, external electromagnetic fields, and other changing operating conditions.

In order to construct robust integrated circuits that will operate for years, several steps need to be taken.
Features, that determine the behaviour of integrated circuits are:

  •  Temperature. Integration circuit operation temperature may vary significantly. For example, transistor junction temperature consists of ambient temperature, and temperature dissipated on the transistor package, which depends on the power consumption and thermal resistance. Most devices are certified to operate at up to 125°C junction temperature.
  • Voltage. Normally devices operate with nominal voltage that may slightly vary due to tolerance range.
  • Process. All process variations can be classified as inter-die and intra-die, which include doping concentration, dimensions, and layer thickness. The most important process variations for devices are channel length and threshold voltage. For interconnects there are lateral dimensions, line width, spacing, thickness and others. Classification of process variations is:
    •  Lot-to-lot;
    • Wafer-to-wafer;
    • Die-to-die, inter-die;
    •  Intra-die.

Lot is a processing step of manufacturing wafers.

Integrated circuit design involves understanding and analysing potential failures. The most frequent reasons of failures are:

  • Latchup
    • Overvoltage
    • Oxide wearout
    • Interconnect wearout

Device failure is a consistent deviation of the device operation from the original specification. Faults are subsystem failures. The reason of faults can be external reasons, device wearouts, manufacturing defects and design bugs.

MTBF (mean time before failure) can be defined as number of devices × hours of operationsnumber of failures.

FIT (failures in time) can be defined as the amount of failures that occur within 1,000 hours of operation per million of devices. The standard reliability curve of devices is depicted in Figure 1.

Figure 1. Failures curve.

The early infant morality failure region is the region where devices experience early failures, caused by process bugs and other creation failure reasons.

The constant failure region is characterised with low failure rate of devices.

The wearout failure region is the region of late device wearout.

Devices are usually subjected to accelerated life testing, which simulates the device ageing process within a short amount of time.

Latchup is the phenomenon when low resistance paths between GND and V_DD melt down, creating parasitic bipolar transistors. This process can be easily avoided by minimising substrates and well resistances. Usually it is achieved with a thin Si-doped epi-layer on top of heavily doped substrates.

Overvoltage is a failure process initiated by the excessive power supply transient or electrostatic discharge. If overvoltage occurs on the gate, it initiates oxide voltage.

Oxide wearout is a failure process initiated by stressed gate oxides. The circuit fails because the transistors become too slow, and mismatches too large. Oxide wearout includes hot carriers, negative bias temperature instability, and time-dependent dielectric breakdown.

Interconnect wearout is a failure process initiated by high currents flowing through the wires. For wires carrying DC currents the main reason of failure is electromigration, for wires carrying AC currents it is self-heating.