Brief Description
Reporting to the Senior Manager – Systems Engineering, the SRE team lead will be responsible for championing and driving operational excellence through driving the adoption of SRE best practices and ensuring system availability, performance, efficiency, change management, monitoring, emergency response, security and capacity planning.
Key Responsibilities:
Oversee and lead the implementation of the SRE frameworks and practices within the organization using the systems operations tool chain. Foster a collaborative and inclusive team culture that emphasizes reliability, innovation, and continuous improvement.
Team Management: Ensure team performance management while fostering an environment of trust, learning, collaboration and cultivate a culture of high performance.
Build, recruit, retain, manage and develop a world class SRE team.
Operational Excellence – Define, measure, monitor and report key SRE performance indicators and escalate breaches and violations. This will help in informing the maturity level of the team as well as to inform the Backlog and related decisions.Collaborate with cross-functional teams to identify, prioritize, and address reliability issues.
Stakeholder Engagement by engaging the business teams and promoting a culture of participation and collaboration to enhance effective and informed decision making.
Define, measure, monitor and report key systems reliability performance indicators and escalate breaches and violations.
Problem and Incident management – lead incident response efforts, ensuring that incidents are resolved quickly and effectively while minimizing downtime and customer impact. Conduct post-incident reviews to identify root causes and implement preventive measures.
Capacity Planning – Monitor system resource utilization and plan for capacity upgrades as needed to support business growth. Optimize resource allocation and cost-efficiency.
Security and Compliance: Collaborate with security teams to ensure the reliability and security of systems and applications. Ensure compliance with relevant industry standards and regulations.
Drivecontinuous improvement of applications through planned chaos simulations, AIOPs, automation and proactive alerting strategies.
Documenting “tribal” knowledge and constant upkeep of the playbooks, runbooks to ensure teams get the information they need right when they need it.
Champion and lead implementation of machine learning, self-healing and drive the organization towards a no-ops model.
Qualifications:
Bachelor’s degree in Computer Science, Information Technology, or a related field (Master’s degree preferred).
Several years of experience in SRE or a related field, with a proven track record of improving system reliability.
Strong leadership and team management skills.
Proficiency in programming/scripting languages (e.g., Python, Go, Ruby).
Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).
Knowledge of cloud computing platforms (e.g., AWS, Azure, Google Cloud).
Familiarity with monitoring and alerting tools (e.g., Prometheus, Grafana, ELK Stack).
Excellent problem-solving and communication skills.
Ability to work in a fast-paced, dynamic environment and handle high-pressure situations effectively.
go to method of application »
Use the link(s) below to apply on company website.
Apply via :