Senior Staff Site Reliability Engineer

Description:

The SRE org at Matillion is made up of multiple teams which combined, own the operation and efficiency of our cloud platforms and services. It covers everything from the build, provisioning and maintenance of our cloud Infrastructure as well as reliability, capability management, observability, monitoring and metrics of our SaaS platform.

Reporting into the Director of SRE and Observability, you will utilise your experience across all pillars of Site Reliability Engineering to drive best practice aimed at enhancing our ability to build truly reliable, observable and performative infrastructure for all our core services. Your experience building modern, multi-cloud platforms will play a pivotal role as we continue to modernise our stack and implement a wide range of new tools around logging, monitoring, metrics and alerting.

What You’ll Be Doing:

Leading the design of major software components, systems, and features to improve the availability, scalability, latency, and efficiency of Matillion’s SaaS services
Drive the design, implementation and management for expanding observability infrastructure, keeping up to date with new tools and technologies and be a recognised member of the broader Observability community
Lead sustainable incident response, blameless postmortems, and production improvements that result in direct business opportunities for Matillion
Define and document best practices across all pillars of SRE
Providing guidance and mentorship to other team members on managing end-to-end availability and performance of critical services, design techniques and coding standards to cultivate innovation and collaboration across the business
Balancing competing priorities as you manage a range of individual projects, deadlines, and deliverables

What we’re looking for:

A passion for everything performance, observability, availability, scalability and security
Extensive experience with Kubernetes and the surrounding ecosystem with tools such as Linkerd, Traefik and ArgoCD
An adopter and champion of core SRE principles including SLA’s, SLO’s, automation, proactive monitoring, release and deployment
Exposure to working with high traffic, large scale web operations in AWS
Ability to manage and provision infrastructure using code with Terraform or CloudFormation as well as build internal tooling with the likes of Go or Python
A solid understanding of networking systems and protocols

Organization	Matillion
Industry	Engineering
Occupational Category	Senior Staff Site Reliability Engineer
Job Location	Dublin,Ireland
Shift Type	Morning
Job Type	Full Time
Gender	No Preference
Career Level	Intermediate
Experience	2 Years
Posted at	2023-12-11 10:59 am
Expires on	2025-05-29