PRIMARY PURPOSE OF THIS POSITION:
The Lead Site Reliability Engineer (SRE) is primarily responsible for leading a Site Reliability Engineering program for the purpose of building and running large-scale, distributed, fault-tolerant systems with reliability and uptime appropriate to users’ needs. The Lead SRE is also responsible for the performance, training, discipline and development of assigned personnel; and provides input and assistance with budgeting, financial management, and technical system design and selection.
ESSENTIAL FUNCTIONS: (other duties may be assigned)
- Coordinate and manage the activities of a Site Reliability Engineering (SRE) team responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning for their assigned system(s).
- Work with appropriate stakeholders to identify Service Level Objectives (SLOs) for critical systems. Identify the Service Level Indicators (SLIs) required to effectively measure SLOs and work with the appropriate technical stakeholders to implement and/or instrument the required systems and processes to measure and monitor the SLIs.
- Build, administer and participate in a program to minimize change disruptions by identifying, developing, and implementing automation to implement progressive rollouts, quickly and accurately detect problems and roll back changes safely when problems arise.
- Identify, pursue, and implement systems and tools to eliminate toil work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.
- Administer and participate in, as needed, a postmortem program for all significant incidents. This program should include an investigation to establish what happened in detail, find all root causes of the event, and assign actions (via User Stories) to correct the problem or improve how it is addressed next time.
- Work with the appropriate service owners/stakeholders to develop playbooks containing best practices for troubleshooting steps and tips for the most likely and impactful failure modes. Exercise the playbooks through group tabletop exercises.
- Collaborate with the Systems Monitoring Lead to ensure monitoring systems to include the corresponding people, processes, and tools are appropriately defined and implemented.
- Identify and facilitate individual development plans (IDPs), to include both formal and informal development opportunities, for direct reports.
- Assist in the development of annual budgets for assigned area of responsibility and monitor spend and performance to optimize organizational profitability.
- Provide evening and weekend support to the team as needed.
REQUIREMENTS: (Equivalent combinations of education, licenses, certifications and/or experience may be considered)
- A four-year degree in Computer Science, Management Information Systems, Computer Engineering; or a four-year degree in another field of study which includes courses in computer programming, systems analysis, system development, or systems engineering; or relevant work experience is required
- 6 years of applicable experience in a technology environment, preferably with time spent in an engineering capacity, is required.
- 2 years of multi-person team management experience, including task assignment, performance coaching and reviews, hiring and firing, and conflict management is required.
- 2 years+ working in a Cloud environment with AWS as the preferred
- 4 years+ working with performance monitoring tools like Datadog, Dynatrace, etc…
- Coding experience beyond simple scripts is required.
- Experience with distributed storage technologies like NFS, HDFS, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn) is preferred.
- SRE Practitioner, SRECP or equivalent is preferred.
Tools & Equipment
- General Office Equipment
If you have this experience, feel you are a fit for this position, and are interested, please answer the questions below: