Subho Sankar Banerjee

Software Engineer @ Google

I am a Software Engineer at Google, working on reliability, analytics and performance evaluation for large-scale AI systems. I work across different layers of the stack to detect, localize, and mitigate fail-stop, fail-wrong, and fail-silent issues within Google’s infrastructure, spanning CPUs, TPUs, GPUs, and NICs.

My research interests lie at the intersection of AI and systems, specifically applying AI methods to improve system reliability and performance. I completed my Ph.D. in Computer Science at the University of Illinois at Urbana-Champaign, advised by Prof. Ravishankar K. Iyer. My dissertation research focused on establishing a framework (using reinforcement learning) for the control, management, and optimization of large-scale heterogeneous computer systems.

News [More Entries]

Aug 28, 2025 Our paper on silent data corruption from defective chips has been accepted at IEEE Design & Test.
Oct 20, 2021 Our paper on characterizing latency variation in serverless FaaS has been accepted at WoSC 2021.
Aug 20, 2021 Our paper on accelerating PairHMM computations on GPUs has been accepted at ICCD 2021.
Nov 19, 2020 Our paper on correcting CPU-performance counter sampling errors has been accepted at ASPLOS 2021.
Sep 5, 2020 Our SC 2020 paper has been nominated for the best paper and best student paper awards.

Selected Publications [Full List: Publications, Projects]

2025

Silent Data Corruption by 10× Test Escapes Threatens Reliable Computing.
Subhasish Mitra, Subho Banerjee, Martin Dixon, Rama Govindaraju, Peter Hochschild, Eric Liu, Bharath Parthasarathy, and Parthasarathy Ranganathan.
IEEE Design & Test.

2021

BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics.
Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
ASPLOS 2021.

2020

Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems.
Saurabh Jha, Shengkun Cui, Subho S. Banerjee, Tianyin Xu, Jeremy Enos, Mike Showerman, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
Supercomputing 2020.
- Best Paper & Best Student Paper Finalist
FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices.
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
OSDI 2020.
- DOI
- arXiv
- Code
- Data
- Paper
Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters.
Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
ICML 2020.
ML-driven Malware that Targets AV Safety.
Saurabh Jha, Shengkun Cui, Subho S. Banerjee, James Cyriac, Timothy Tsai, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
DSN 2020.

2019

ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection.
Saurabh Jha, Subho S. Banerjee, Timothy Tsai, Siva K. S. Hari, Michael B. Sullivan, Zbigniew T. Kalbarczyk, Stephen W. Keckler, and Ravishankar K. Iyer.
DSN 2019.
AcMC²: Accelerated Markov Chain Monte Carlo for Probabilistic Models.
Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
ASPLOS 2019.
CAUDIT: Continuous Auditing of SSH-Servers To Mitigate Brute-Force Attacks.
Phuong M. Cao, Yuming Wu, Subho S. Banerjee, Justin Azoff, Alex Withers, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
NSDI 2019.

2018

ASAP: Accelerated Short Read Alignment on Programmable Hardware.
Subho S. Banerjee, Mohamed el-Hadedy, Jong B. Lim, Steve Lumetta, Zbigniew T. Kalbarczyk, Deming Chen, and Ravishankar K. Iyer.
IEEE Transactions on Computers.
Hands Off the Wheel in Autonomous Vehicles? A Systems Perspective on over a Million Miles of Field Data.
Subho S. Banerjee, Saurabh Jha, James Cyriac, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer.
DSN 2018.