Description:
We are seeking a talented and motivated Site Reliability Engineer to join our growing team. As an SRE at Cloudbeds, you will be responsible for ensuring the reliability, availability, and performance of our systems and applications. You will collaborate with cross-functional teams to design and implement scalable and resilient solutions, leveraging automation and best practices in site reliability engineering. You will have endless opportunities for architecture design and implementation within AWS cloud infrastructure in a largely bottom-up and healthy debate team culture. As an SRE Engineer, you will help us in providing the highest quality full-stack management solution for hotels, B&B’s, hostels, and vacation rentals all over the world.
Location: REMOTE - Europe
What You Will Do
- Design and implement reliable, scalable, and efficient systems to meet the needs of the organization.
- Maintain and support highly loaded Kubernetes (EKS) clusters and infrastructure-related components.
- Develop and continuously improve Product monitoring and logging systems based on the Prometheus, DataDog, and Loki stacks.
- Respond to and resolve incidents, ensuring minimal impact on services.
- Collaborate with development teams to establish Service Level Objectives (SLOs) and ensure systems meet or exceed reliability targets. Optimize system performance and troubleshoot issues as they arise.
- Support development teams by sharing SRE best practices and expertise, assist in environment and application configuration from the resiliency perspective.
- Collaborate with security teams to implement and maintain security best practices.
- Support the release process via CI/CD pipelines.
- Automate the platform with infrastructure-as-code and configuration management.
- Maintain clear and comprehensive documentation for systems, processes, and procedures. Share knowledge with team members to enhance overall understanding.
- On-call rotation support for the production environment outages.
You’ll Succeed With
- Bachelor’s degree in Computer Science or related field, or equivalent experience.
- 3+ years experience as a DevOps or SRE Engineer, working with AWS.
- Exceptional skills in Linux system administration.
- 2+ years of strong Experience in Kubernetes, Docker, Helm charts.
- Experience implementing and scaling Elastic Kubernetes (EKS) platforms.
- Strong Experience with application containerization methodologies and delivery.
- Strong Experience with monitoring, logging, and alerting technologies (any of ELK, Datadog, Loki, AWS Cloudwatch).
- Experience with infrastructure-as-code methodologies such as Terraform.
- Experience with designing, building, and supporting CI/CD pipelines (Github Actions, Bitbucket pipelines, and ArgoCD).
- Experience with web application servers (NGiNX, Ingress controllers, traffic load balancing), databases (MySQL, PostgreSQL, Aurora), cache technologies (any of Redis, Memcached), and queue technologies (SQS).
- Ability to write Bash/Python scripts.
- Good networking skills.
- Good written and verbal communication in English.
- Good team player qualities.
- Ability to work remotely and manage your own time in a global team.