Engineer, Reliability

Job Purpose
To create a bridge between development and operations by applying a software engineering mindset to system administration. To focus on operations/on-call duties and developing systems and software that help increase site reliability and performance. To build self-service tools for users that rely on such services; to collaborate with product developers to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability.
Key Deliverables 

Automate CI/CD pipeline for both legacy architecture and containerized platforms using infrastructure as code and software development skills so as to increase the speed and quality of software delivery.
Automate the provision of, and modifications to infrastructure of production and non-production environments to minimize configuration drift and maintain consistency across environments.
Build dashboards to improve visability of the build and release processes, system performance, availability, latency, throughput and error rate.
Conduct and document post-mortems and incident reviews, and take action on outcomes to maximise learnings so as to prevent repeat incidents and improve future responses.
Continuously improve upon the monitoring, incident response, and the optimisation of service availability and performance, and suggest methodical approaches for implementation. Communicate proposed changes across the organisation to ensure efficient and structured production support and emergency response.
Define and communicate the scale, capacity, security, performance attributes, and requirements of the service and technology stack.
Define and implement mechanisms to monitor service-level indicators for the underlying service by setting units of measurement that define the service level that customers can expect of the system, defining the desired outputs of the system in terms of availability, and communicating the expected reliability of the service to customers in order to facilitate the speed at which business can release new features and services.
Design and implement monitoring solutions in order to identify performance errors and maintain service availability.
Develop software to automate manual processes to expedite problem detection and mitigation.
Drive collaboration between people, processes and technology to lead to a proactive system of incident response and remediation.
Drive the improvement of service performance metrics such latency, page load speed and ETL by proactively identifying performance issues across the system so that customers are enabled to make full use of the system.
Ensure an efficient system for incident response by making the appropriate information available in order to quickly identify and fix problems.
Identify and automate manual and repetitive work to reduce toil.
Identify and implement mechanisms to reduce the noise in alerting and maximise the signal so that notifications and problems are only sent for those that need human intervention and is directly related to a defined and agreed SLO.
Identify opportunities and implement solutions to optimise service monitoring, availability, performance.
Provide insight and guidance on the end-to-end performance and operability of a service. Partner with development teams to define and implement improvements in service architecture.
Provide insights into the design and implementation of services with a focus on security, resiliency, scale, and performance by having a rich understanding of the end-to-end configuration, technical dependencies, and overall behavioural characteristics of the production service/s.
Validate recovery and failover strategies by performing rigorous system failure testing.

QUALIFICATIONS
Minimum Qualifications

Type of Qualification: First
Field of Study: Information Technology

Experience Required
Software Engineering

Technology
5-7 years
Proven experience in IT Software Development and at least one programming language and experience building scalable systems with service-oriented architectures.

ADDITIONAL INFORMATION
Behavioral Competencies:

Adopting Practical Approaches
Articulating Information
Checking Details
Developing Expertise
Documenting Facts
Embracing Change
Examining Information
Interpreting Data
Managing Tasks
Producing Output
Taking Action
Team Working

Technical Competencies:

Application Knowledge for Support
Business Continuity and Disaster Recovery Planning
Information Technology Architecture
Infrastructure and Platforms Support
IT Design Driven Development
Service Management Processes
Use of Build and Test Automation
Use of Version Control

Apply via :

www.standardbank.com

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

More posts