Description:
Job Description
What you get to do in this role:
- Provide relief and sustainable resolution to issues within our infrastructure.
- Use your experience in software development, systems engineering and networking to proactively prevent repeatable issues.
- Drive initiatives with partner teams to improve the reliability and performance of the infrastructure through improved system design.
- Contribute to Configuration Management and Infrastructure as Code for global private cloud (puppet).
- Develop tools in Ansible, Python, bash, and JavaScript to replace manual work and improve customer maintenance experience.
- Drive enhancements and bugfixes for large scale automation projects such as patching and provisioning.
- Design and implement procedure to accomplish maintenances where automation and tooling cannot; drive resolution of root causes with internal team members.
- Prepare new ServiceNow products and services for production readiness with design review, feedback to engineering teams, training, and testing.
- Use broad knowledge and experience of systems administration and networking principles to proactively prevent and address incidents while constantly improving documentation.
- Participate in escalations and Root Cause Analysis of issues in the global ServiceNow infrastructure.
- Troubleshoot database backup and restore failures as well as perform database migrations.
- Support operation of a wide variety of infrastructure services including Machine Learning and Prediction, Cloudera Big Data clusters, Kafka and RabbitMQ messaging, database encryption, E-Mail infrastructure at scale, DNS, Puppet, Elasticsearch, F5 BigIP, and more.
Qualifications
To be successful in this role you have:
- A strong background in Linux Systems Administration (CentOS/Redhat) and engineering, understanding of the components of cloud infrastructure including hardware platforms, OS, applications, databases (MariaDB), networks, web, and application servers (Apache/Tomcat).
- Prior experience in Site Reliability Engineering/DevOps/System Administration and managing large-scale server infrastructure at a cloud computing or MSP setting is highly desirable.
- Solid experience with Linux (RedHat and/or CentOS)
- Working-level knowledge of one: Python, bash, JavaScript
- ServiceNow development experience is desirable.
- Strong experience with service troubleshooting in a production environment covering web front-end/application, Systems, Databases and Networks.
- Previous direct exposure to administrating fundamental internet services (DNS, Mail, Apache/Tomcat) with a good understanding of the LAMP stack.
- Familiarity with administrating MySQL, Oracle, MariaDB or similar technologies; proficiency preferred.
- Familiarity with Networking Technologies such as routing, switching and load balancing (VPN exposure is a huge plus)
- Experience with systems and network performance and availability monitoring and analysis as well as configuration management platforms (Nagios/Icinga, Puppet, Ansible, Splunk) is desirable.
- Understanding of ITIL v3 framework and how it applies to incidents, problems and changes.
- Good communication skills and work well in a collaborative team environment.