Service Reliability Engineer

Detailed Description
Reporting to SRE Lead, the Service Reliability Engineer will be responsible for stabilizing production systems, improving system availability and reliability, ensuring automation of operational tasks, change management, system monitoring, incidents response and capacity planning. In addition, this role will be responsible for:

Ensuring operational excellence through proactively building and implementing services, including end to end monitoring, scripting and automation, modern tooling, and maintenance of software;
Providing software-related operations support, including managing level two and level three incident and problem management;
Define, measure, monitor and report key SRE performance indicators and escalate breaches and violations;
Documenting “tribal” knowledge and constant upkeep of the playbooks and runbooks to ensure teams get the information they need right when they need it and;
Implementation of machine learning, self-healing and drive the organization towards a no-ops model.

Key Accountabilities:

Run the production environment by monitoring availability and taking a holistic view of system health.
Create sustainable systems by driving continuous improvement of the applications through chaos experiments, automation, ML/AIOPs and proactive alerting strategies.
Building and setting up new development tools and infrastructure.
Working on ways to automate and improve development and release processes.
Implement SRE frameworks and practices within the organization using the systems operations tool chain.
Operational Excellence – ensure systems availability, performance, efficiency, change management, monitoring, emergency response, security, and capacity planning.
Stakeholder Engagement – Engage the business teams and promoting a culture of participation and collaboration to enhance effective and informed decision making.
Define, measure, monitor and report key systems reliability performance indicators and escalate breaches and violations, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve.
Continually improve skills and competencies by proactively participating in various internal and external training opportunities and stretch assignments.
Research on new fit for future technologies and actively implement the viable solutions.

Job Qualifications:

Bachelor’s Degree in Computer Science, Information Systems, Software Engineering, IT, or another related field
More than three years of work experience in programming and /or systems analysis applying agile frameworks
Strong familiarity with web servers and load balancing technologies
Experience using SRE tools such as Ansible, Rundeck, Terraform
Experience using monitoring tools such as Dynatrace/ELK/Splunk
Experience working with multiple programming and markup languages, such as Java, XML, JSON, YAML, Python
Experience in Unix/Linux/AIX Operating System and application security technologies e.g., SSL
Experience using code versioning & collaboration tools such as Git, Bitbucket.
Strong knowledge of software architecture principles
Strong analytical and problem-solving skills
Experience working with agile methodologies, such as Scrum, Kanban,
Professional experience and knowledge of the telecommunications industry preferred
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks

Apply via :

safaricom.taleo.net