I am a Software Engineer at Google, working on reliability, analytics and performance evaluation for large-scale AI systems. I work across different layers of the stack to detect, localize, and mitigate fail-stop, fail-wrong, and fail-silent issues within Google’s infrastructure, spanning CPUs, TPUs, GPUs, and NICs.
My research interests lie at the intersection of AI and systems, specifically applying AI methods to improve system reliability and performance. I completed my Ph.D. in Computer Science at the University of Illinois at Urbana-Champaign, advised by Prof. Ravishankar K. Iyer. My dissertation research focused on establishing a framework (using reinforcement learning) for the control, management, and optimization of large-scale heterogeneous computer systems.
News [More Entries]
- Aug 28, 2025 Our paper on silent data corruption from defective chips has been accepted at IEEE Design & Test.
- Oct 20, 2021 Our paper on characterizing latency variation in serverless FaaS has been accepted at WoSC 2021.
- Aug 20, 2021 Our paper on accelerating PairHMM computations on GPUs has been accepted at ICCD 2021.
- Nov 19, 2020 Our paper on correcting CPU-performance counter sampling errors has been accepted at ASPLOS 2021.
- Sep 5, 2020 Our SC 2020 paper has been nominated for the best paper and best student paper awards.
Selected Publications [Full List: Publications, Projects]
2025
Silent Data Corruption by 10× Test Escapes Threatens Reliable Computing.
IEEE Design & Test.
2021
2020
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems.
Supercomputing 2020.- Best Paper & Best Student Paper Finalist
FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices.
OSDI 2020.Inductive-bias-driven Reinforcement Learning for Efficient Schedules in Heterogeneous Clusters.
ICML 2020.
2019
ML-based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection.
DSN 2019.AcMC²: Accelerated Markov Chain Monte Carlo for Probabilistic Models.
ASPLOS 2019.CAUDIT: Continuous Auditing of SSH-Servers To Mitigate Brute-Force Attacks.
NSDI 2019.
