About Me
Staff Site Reliability Engineer with 20+ years experience scaling mission-critical infrastructure supporting millions of users. Expert in building fault-tolerant, secure systems that enable rapid innovation while maintaining high reliability.
Passionate about leveraging automation and modern cloud-native technologies to solve complex infrastructure challenges at unprecedented scale. Thrive in high-ambiguity environments, bringing clarity and execution focus to complex infrastructure challenges.
Experience
Staff Engineer
Apple
- Architected and led a migration of 280 services from a legacy infrastructure to a cloud-native platform, improving reliability, maintainability, and security while also lowering support costs and team toil.
- Migrated critical encryption key distribution from a legacy system to a centralized Key Management System, automating rollouts, eliminating manual toil, and significantly improving security posture.
- Established SLO framework across 280 services and 16 partner teams bringing critical focus to reliability gaps.
Engineering Manager
Apple
- Refactored the configuration management system to remove duplicate and conflicting systems.
- Grew site reliability engineering team from four engineers to 15 across 3 sub-teams.
- Designed and implemented a new interview process for site reliability engineers used across 50 peer managers.
- Leveraged SLOs with product engineering partners to measure and maintain service reliability over 99.99%.
- Created a new post incident review process and drove adoption across the larger organization.
- Maintained high team morale and productivity during numerous reorganizations.
Site Reliability Engineer
Stripe
- Drove our "carrot" effort, a tool that scanned Stripe services for reliability risks and made suggestions for teams to improve their reliability.
- Built an internal dashboard to show which code paths were struggling with availability and followed up with the responsible teams.
- Designed and implemented a production readiness review process for new services.
- Reorganized the Stripe post-mortem and incident review process to make sure that we could track metrics from incidents and drive follow-up for remediation items.
Senior Site Reliability Engineer
Netflix
- Drove our "production ready" effort, a checklist of important reliability steps for development teams, and developed a tool that scanned Netflix microservices for reliability risks and made suggestions for teams to improve their reliability.
- Consulted with microservice teams to drive reliability efforts, including adding monitoring, alerting, deployment practices, application tuning and chaos resiliency. This effort significantly increased the percentage of critical infrastructure teams that had effective monitoring and alerting.
- Lead post-mortem incident reviews to identify root cause and ensure remediation and created a new "after-incident report" process so single page write-ups of incident learnings could be communicated broadly throughout Netflix.
- Participated in oncall rotation and incident response, initial triage, ensuring communication and making sure engineering teams were focused on mitigating customer impact during major outages.
Lead Production Engineer
Quantcast
- Built and maintained bare metal, high-performance AI/ML clusters supporting large-scale data processing and machine learning workloads.
- Expanded Quantcast's systems, supporting 10x growth while containing costs and keeping response times under 100ms.
- Mentored junior operations engineers and new graduates, from onboarding to training them in large scale system operations and troubleshooting.
- Managed Quantcast's capacity planning and budget, building out 17 edge PoPs worldwide while maintaining capacity and performance goals.
Early Career (2000 - 2010)
Technical Skills
Cloud & Infrastructure
AWS, GCP, Kubernetes, Docker, Terraform, Helm, IaC
Programming
Python, Go, AI-Assisted Development (Claude Code)
Observability
Prometheus, Grafana, OpenTelemetry
SRE Practices
SLI/SLO, Error Budgets, Chaos Engineering, Incident Response
Automation
Jenkins, Spinnaker, CI/CD pipelines, Infrastructure automation
Talks & Publications
Ten Persistent SRE Antipatterns
Pitfalls on the Road to a Successful SRE Program.
Talk SCaLE 15x (Mar 2017)Configuration Management is an Anti-pattern
Immutable Infrastructure.
Talk Velocity Ignite 2016 (Jun 2016)The Cloud Will Not Save You
From your technical debt.
Talk USENIX SREcon16 (Apr 2016)Netflix: 190 Countries and 5 CORE SREs
How does Netflix scale SRE?
Talk SCaLE 14x (Jan 2016)From Sysadmin to SRE
How Netflix views the Site Reliability Engineer role.
Monitoring with Ganglia
Quantcast case study on metrics scaling and Holt-Winters aberrance detection.
Education
University of Cincinnati
Bachelors of Science in Computer Engineering
Cincinnati, OH
2005