Description:
We are seeking a highly motivated Site Reliability Engineer to join our clients Business Operations team.
You will support their developers during the application build phase in software-run principles that includes operational design, automation, capacity planning, monitoring that leads to fault-tolerant, and scalable products.
Responsibilities:
- Support daily operations with a hyper focus on triage, root cause by understanding the business impact of their products and subsequently performing blameless post-mortems.
- Risk management by tying all their activities together with an overarching responsibility for compliance and risk mitigation across all our environments.
- Align Product and Customer Focused priorities with Operational needs by providing continuous feedback throughout the lifecycle.
- Serve as the primary contact responsible for ensuring application scalability, performance, and resilience.
- Practice sustainable incident response and blameless post-mortems while taking a holistic approach to problem solving and optimizing time to recover.
- Automate data-driven alerts to proactively escalate issues. Work with development teams to establish SLOs and improve reliability.
- Tackle complex development, automation, and business process problems. Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation, and refinement.
- Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead our client in DevOps automation and best practices.
- Increase automation and tooling to reduce toil and manual intervention
- Analyses ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns
Requirements:
- Coding experience in one or more of the following: C++, Java, Python, Go
- Experience with algorithms, data structures, scripting, pipeline management, and software design.
- Experience with industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Artifactory, Groovy and Chef
- Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must.
- Experience in a SRE role or related field.
- Background on cloud native tooling and orchestration technologies (Kubernetes preferred).
- Experience in Monitoring tools such as Splunk, Dynatrace.
- Experience with Java, J2EE, WebServices (SOAP/REST), Spring/Spring Boot is a plus.
- Experience in production support environments and ITIL processes.