Site Reliability Engineer

Palo Alto, California, United States | Infrastructure | Full-time


We build advanced Network Analytics services that are hosted in public clouds mainly AWS and very soon GCE. We provide beautiful, fast and informative interface for visualizing hundreds of thousands of nodes and millions of events per second. We have a tradition of celebrating elegantly designed and well-tested code for every part of our product while making it easy to deploy. We take pride in being able to push our latest code to production environments multiple times a day.

Uncharacteristically, we work very closely with the development team and have the chance to influence decisions about the product with respect to long term maintenance and deployability, a strong candidate will have both systems and software background.

As an integral member of Deployment, Infrastructure and Operations team, you will be closely involved in product deployment and troubleshooting. You will be embedded within cross-functional teams that include big data engineers, data scientists and platform engineers. The infrastructure you develop and maintain will enable data center network visibility and management at unprecedented scale.

We hope that you are a self-motivated, passionate and curious team capability multiplier who enjoys developing libraries and tools. You embrace agile methodologies that involve rapid prototyping by feedback gathered from customers, peers and technical support engineers.


  • Architect, implement, test and deliver software and tools to improve the resiliency and availability, scalability, and efficiency of Tetration's services.
  • Troubleshoot and solve the many and varied problems related to keeping the Tetration’s services running reliably.
  • Build automation to prevent recurrence of problems.
  • Participate in a roaster of Escalation engineers for customer issues related to Tetration services running in Public Cloud
  • Participate and influence architectures and designs for highly available systems and services


  • BS degree in Computer Science or related field.
  • Understanding of complexity of algorithms.
  • Well versed in algorithms in general, databases, and software design.
  • Experience in one or more programming language: Python, Java, C++.
  • Can do, will do attitude

Preferred Skills

  • Experience in troubleshooting and debugging distributed systems, particularly in AWS
  • In-depth understanding of Linux, and experience at SysAdmin level
  • Cloud Formation
  • Amazon Networking
  • Auto scaling
  • In-depth knowledge of EC2 features and functionality
  • Good understanding of Networking fundamentals (L2, L3 -addressing/subnetting/routing-, TCP/UDP/ICMP, client/server)
  • Superb troubleshooting skills and root cause analysis, tied together with a sense of ownership
  • Working experience with Hadoop, Druid, Zookeper