Site Reliability Engineering Manager

Job Purpose:   
As Systems Site Reliability Engineer Manager, you will lead and support a team of SRE engineers who are working to identify challenges, analyze causes and apply corrective action to ensure that our systems are reliable, scalable and performant as per agreed service level objectives.
Reporting to the Principal SRE (Site Reliability Engineering) Lead, you will be a part of the team responsible for helping to support 24×7 uptime and availability of production mission-critical services within the Bank. You will help to create more consistent, automated environments across all applications or services, proactively test and tune all aspects of the platforms, streamline CI/CD processes, monitor, and respond to system notifications and alerts and continually work to optimize and improve the performance, security, and reliability of our systems.
Job Responsibilities/Accountabilities

Lead SRE (Site Reliability Engineering) initiatives in your areas of focus
Mentor and support the members of the team to achieve high levels of performance
Lead the identification and establishment of the service level indicators to support SLOs (Service Level Objectives)
Take ownership of the availability, stability, resilience, and system / service health
Provide technical leadership in initiatives to improve availability, stability, resilience of our services
Take leadership in incident response activities to restore services
Collaborate with Dev teams to improve services through rigorous testing and release procedures
Participate in architecture design, platform management, and capacity planning exercises.
Create sustainable systems and services through automation and uplifts

Qualifications And Skills

Bachelor’s degree in computer science or equivalent
5+ years’ experience as a SRE/DevOps Lead
Experience in managing SRE/DevOps/Software engineers
Strong oral and written communication skills
Attention to detail and strong troubleshooting skills
Demonstrable experience in Containerization-Docker and orchestration (Kubernetes)
Demonstrable experience in CI/CD tools such as Azure DevOps, circle CI, Jenkins etc.
Good understanding of Infrastructure as Code (Terraform, Cloud Formation, Ansible)
Familiarity with Linux and UNIX systems and command line system administration such as Bash, VIM, SSH (secure shell).
Basic scripting skills (preferably Golang, bash, shell, etc.,)
Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools – Dynatrace, Azure Application insights, Prometheus, SolarWinds
Good understanding of networking concepts e.g., Network routing, Load balancing, and Networking protocols, a base knowledge of TCP/IP, with an understanding of HTTP and DNS
Experience in programming (structured and OOP) with one or more high level languages, such as Python, Java, .NET, and JavaScript
Knowledge and proven hands-on experience in large-scale databases and distributed technologies, such as Kafka and Redis will be an added advantage

Apply via :

equitybank.taleo.net