Silent Data Corruption by 10× Test Escapes Threatens Reliable Computing
Subhasish Mitra, Subho Banerjee, Martin Dixon, Rama Govindaraju, Peter Hochschild, Eric Liu, Bharath Parthasarathy, and Parthasarathy Ranganathan
IEEE Design & Test
Abstract
Too many defective compute chips are escaping today’s manufacturing tests – at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes, when left unaddressed, pose a major threat to reliable computing. We present a three-pronged approach outlining future directions for overcoming test escapes: (a) Quick diagnosis of defective chips directly from system-level incorrect behaviors. Such diagnosis is critical for gaining insights into why so many defective chips escape existing manufacturing testing. (b) In-field detection of defective chips. (c) New test experiments to understand the effectiveness of new techniques for detecting defective chips. These experiments must overcome the drawbacks and pitfalls of previous industrial test experiments and case studies.
Citation
@Article{Mitra2025,
author={Mitra, Subhasish and Banerjee, Subho and Dixon, Martin and Fuller, Mike and Govindaraju, Rama and Hochschild, Peter and Liu, Eric X. and Parthasarathy, Bharath and Ranganathan, Parthasarathy},
journal={IEEE Design & Test},
title={Silent Data Corruption by 10× Test Escapes Threatens Reliable Computing},
year={2025},
volume={},
number={},
pages={1-1},
doi={10.1109/MDAT.2025.3602741}
}