Senior (5+ years of experience)
Summary of the Role
As a Senior Site Reliability Engineer, you will be responsible for ensuring high availability, performance, security, and scalability of our production systems. You will work closely with development teams to integrate infrastructure builds with application deployment processes. With a strong background in software engineering and systems administration, you will bridge the gap between development and operations by applying a software engineering mindset to system administration tasks.
Required Skills
Systems analysis and troubleshooting in a complex environment.
Coding/scripting to automate systems and infrastructure tasks.
Deep understanding of monitoring solutions and APM tools.
Collaboration skills and ability to work effectively in a team environment.
Experience with continuous integration and deployment (CI/CD) pipelines and DevOps practices.
Qualifications
Bachelor's or master's degree in Computer Science, Information Systems, or a related field, or equivalent experience.
5+ years of experience in a Site Reliability Engineering role or similar.
Experience with cloud services (AWS, GCP, Azure, etc.), containerization technologies (Docker, Kubernetes), and Terraform.
Strong understanding of networking, security, and database architectures.
Proficiency in at least one of the following languages: Go, Python, Ruby, Java, or C++.
Responsibilities
Design, write, and maintain software to improve the availability, scalability, latency, and efficiency of services.
Engage in and improve the whole lifecycle of services - from inception and design, through deployment, operation, and refinement.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.