Jonah Horowitz

About Me

Hands-On Engineering Leader with 20+ years experience scaling mission-critical infrastructure supporting millions of users. I am a deeply technical leader who thrives in management roles, expertly guiding teams to build fault-tolerant, secure systems that enable rapid innovation while maintaining high reliability.

Passionate about leveraging automation and modern cloud-native technologies to solve complex infrastructure challenges at unprecedented scale. I excel in high-ambiguity environments, bringing clarity, strategic direction, and execution focus to engineering organizations.

Experience

Staff Engineer

Apple

March 2023 - Present Cupertino, CA

Architected and led a migration of 280 services from a legacy infrastructure to a cloud-native platform, improving reliability, maintainability, and security while also lowering support costs and team toil.
Migrated critical encryption key distribution from a legacy system to a centralized Key Management System, automating rollouts, eliminating manual toil, and significantly improving security posture.
Established SLO framework across 280 services and 16 partner teams bringing critical focus to reliability gaps.

Engineering Manager

Apple

January 2018 - March 2023 Cupertino, CA

Refactored the configuration management system to remove duplicate and conflicting systems.
Grew site reliability engineering team from four engineers to 15 across 3 sub-teams.
Designed and implemented a new interview process for site reliability engineers used across 50 peer managers.
Leveraged SLOs with product engineering partners to measure and maintain service reliability over 99.99%.
Created a new post incident review process and drove adoption across the larger organization.
Maintained high team morale and productivity during numerous reorganizations.

Site Reliability Engineer

Stripe

March 2017 - January 2018 San Francisco, CA

Drove our "carrot" effort, a tool that scanned Stripe services for reliability risks and made suggestions for teams to improve their reliability.
Built an internal dashboard to show which code paths were struggling with availability and followed up with the responsible teams.
Designed and implemented a production readiness review process for new services.
Reorganized the Stripe post-mortem and incident review process to make sure that we could track metrics from incidents and drive follow-up for remediation items.

Senior Site Reliability Engineer

Netflix

April 2015 - January 2017 Los Gatos, CA

Drove our "production ready" effort, a checklist of important reliability steps for development teams, and developed a tool that scanned Netflix microservices for reliability risks and made suggestions for teams to improve their reliability.
Consulted with microservice teams to drive reliability efforts, including adding monitoring, alerting, deployment practices, application tuning and chaos resiliency. This effort significantly increased the percentage of critical infrastructure teams that had effective monitoring and alerting.
Lead post-mortem incident reviews to identify root cause and ensure remediation and created a new "after-incident report" process so single page write-ups of incident learnings could be communicated broadly throughout Netflix.
Participated in oncall rotation and incident response, initial triage, ensuring communication and making sure engineering teams were focused on mitigating customer impact during major outages.

Lead Production Engineer

Quantcast

January 2011 - April 2015 San Francisco, CA

Built and maintained bare metal, high-performance AI/ML clusters supporting large-scale data processing and machine learning workloads.
Expanded Quantcast's systems, supporting 10x growth while containing costs and keeping response times under 100ms.
Mentored junior operations engineers and new graduates, from onboarding to training them in large scale system operations and troubleshooting.
Managed Quantcast's capacity planning and budget, building out 17 edge PoPs worldwide while maintaining capacity and performance goals.

Early Career (2000 - 2010)

Engineering Manager & Software Engineer @ Looksmart (Aug 2007 - Dec 2010)

Software Engineer @ Gemini Mobile Technologies (Jan 2006 - Aug 2007)

Software Engineer @ MediaMaster (Jun 2004 - Dec 2005)

System Administrator @ Walmart.com (Jun 2000 - Sep 2002)

Technical Skills

Cloud & Infrastructure

AWS, GCP, Kubernetes, Docker, Terraform, Helm, IaC

Programming

Python, Go, AI-Assisted Development (Claude Code)

Observability

Prometheus, Grafana, OpenTelemetry

SRE Practices

SLI/SLO, Error Budgets, Chaos Engineering, Incident Response

Automation

Jenkins, Spinnaker, CI/CD pipelines, Infrastructure automation

Talks & Publications

Talk SRECon 17 Americas (Mar 2017)

Monitoring with Ganglia

Quantcast case study on metrics scaling and Holt-Winters aberrance detection.

Education

University of Cincinnati

Bachelors of Science in Computer Engineering

Cincinnati, OH

2005

Hello, I'm

Hands-On Engineering Leader

About Me

Experience

Staff Engineer

Apple

Engineering Manager

Apple

Site Reliability Engineer

Stripe

Senior Site Reliability Engineer

Netflix

Lead Production Engineer

Quantcast

Early Career (2000 - 2010)

Technical Skills

Cloud & Infrastructure

Programming

Observability

SRE Practices

Automation

Talks & Publications

Ten Persistent SRE Antipatterns

Configuration Management is an Anti-pattern

The Cloud Will Not Save You

Netflix: 190 Countries and 5 CORE SREs

From Sysadmin to SRE

Monitoring with Ganglia

Education

University of Cincinnati