We seek a Site Reliability Engineer with 3 to 5 years of hands-on experience to uphold the reliability, performance, and scalability of our cloud infrastructure and services. Ideal candidates will demonstrate a strong commitment to monitoring, automation, and the development of resilient, observable, and highly available systems.
What You’ll Do
Develop, execute, and optimize CI/CD pipelines to ensure seamless and high-performance software delivery.
Utilize Docker to containerize applications and oversee deployments on AWS platforms, including ECS, EC2, and ALB.
Monitor system performance metrics, design and update interactive dashboards, and set up automated alerts to ensure timely notifications of anomalies. Additionally, review and interpret system logs to detect potential issues early and implement corrective measures before they escalate.
Oversee the infrastructure architecture to ensure it scales efficiently, optimizes costs, and maintains high availability at all times.
Lead incident response efforts, perform thorough root cause analyses, and institute corrective actions to mitigate the recurrence of similar issues.
Utilize Python and Bash to develop automated solutions that streamline operational workflows, thereby improving efficiency and ensuring reliability.
Work collaboratively with development teams to enhance deployment workflows and strengthen application instrumentation.
Develop and implement disaster recovery strategies, encompassing backup procedures, failover systems, and resilience assessments to ensure operational continuity.
What We’re Looking For
We are seeking candidates with a minimum of three to five years of hands-on experience in DevOps, Site Reliability Engineering (SRE), or cloud operations.
Proficiency in AWS services, including ECS, EC2, and ALB, as well as expertise in managing cloud infrastructure, is required.
Proficiency with monitoring and observability solutions such as Prometheus, Grafana, and the Loki/ELK stack.
Proven expertise in designing, implementing, and sustaining continuous integration and continuous delivery (CI/CD) pipelines is required.
Demonstrated expertise in Docker and container orchestration platforms is required.
Proficient in developing scripts and automating processes with Python and Bash.
Demonstrates exceptional problem-solving abilities and adeptness at diagnosing and resolving intricate production challenges efficiently.
Nice to Have
Proficiency in Infrastructure as Code (IAC) tools, specifically Terraform, is required.
Proficiency in managing Kubernetes (EKS) environments is required.
Professional experience with MongoDB Atlas operations is required.
Skilled in optimizing cloud expenditures and enhancing system performance through targeted tuning strategies.
What Achieving Success Entails
An outstanding candidate will demonstrate a clear path to success by consistently delivering measurable results aligned with organizational objectives. They will exhibit strong analytical and problem-solving skills, enabling them to identify inefficiencies and implement data-driven solutions. Collaboration across teams will be seamless, ensuring alignment with broader business goals. Adaptability in evolving market conditions and a commitment to continuous improvement will be essential. Proficiency in relevant tools and methodologies, along with a results-oriented mindset, will further distinguish top performers.
Highly dependable, adaptable systems with streamlined operational efficiency are a priority.
Maintain comprehensive oversight of system health and performance metrics across all services to ensure operational excellence.
Enhanced system reliability by decreasing the frequency of incidents and expediting recovery processes.
Automation and operational workflows are seamlessly integrated, ensuring high efficiency in deployment processes.
Qualifications
BA/BSc/HND
Experience Required
3 - 5 years