Description:
We are seeking a highly skilled Technical Support Engineer specializing in Machine Learning (ML) operations, Kubernetes, container technologies, and Run:AI. In this role, you will be responsible for providing technical and operational support for customers leveraging GPU computing platforms to optimize and manage AI/ML workloads, particularly in Kubernetes-based environments. The ideal candidate will have deep expertise in Kubernetes orchestration and GPU management, as well as a solid understanding of how these address AI/ML operations at scale.
Key Responsibilities
- Kubernetes Orchestration & Resource Management: Serve as the subject matter expert for Kubernetes and container orchestration. Guide customers through the design and deployment of Kubernetes clusters tailored for AI/ML use cases, helping them effectively manage workloads through Run:AI. Ensure optimal resource allocation, including GPU sharing, node management, and job scheduling across clusters.
- Cluster Monitoring & Optimization: Monitor and tune Kubernetes clusters to ensure they are optimized for AI/ML workloads. Provide support on managing Kubernetes autoscaling, resource quotas, and performance monitoring of distributed ML models running on Kubernetes clusters via the Run:AI platform.
- GPU troubleshooting and incident response: Diagnose and resolve complex issues regarding dependencies between GPU drivers and software, Nvidia toolkit errors, or GPU component failure.
- Run:AI Platform Support: Provide expert support for the Run:AI platform, assisting customers with the deployment, configuration, and management of Kubernetes clusters that handle AI/ML workloads. This includes setting up the platform, configuring resource pools (GPU, CPU), and optimizing Kubernetes namespaces to ensure proper orchestration of workloads.
- Workload Optimization on Kubernetes: Assist customers in optimizing dynamic resource allocation for their AI/ML workloads by utilizing the Run:AI scheduler in conjunction with Kubernetes's native tools. Help manage job preemption, scheduling priorities, and horizontal scaling of workloads across clusters.
- Kubernetes Troubleshooting & Incident Response: Diagnose and resolve complex issues related to Kubernetes cluster management, including pod failures, node connectivity issues, and namespace misconfigurations. Provide support in handling incidents such as job contention, GPU misallocation, and failed containerized workloads, ensuring smooth operation across the entire Kubernetes environment.
- Integration Support: Help customers integrate Run:AI into their existing Kubernetes-based ML infrastructure. Ensure seamless operation of AI/ML pipelines, covering data flow, distributed training, and model deployment. Troubleshoot issues arising from the interaction between Run:AI, Kubernetes, and other ML tools (e.g., TensorFlow, PyTorch, Kubeflow).
- Security and Best Practices in Kubernetes: Advise customers on security best practices for Kubernetes clusters handling sensitive ML workloads, such as secure pod communications, role-based access control (RBAC), and resource isolation for multi-tenant clusters. Ensure Kubernetes and containerized environments are secure and compliant with organizational policies.
- Collaboration with HQ: Work closely with the engineering and product teams in HQ, providing feedback on Kubernetes-related issues, cluster optimization features, and improvements to the Run:AI platform. Escalate complex issues and contribute to ongoing platform development.
- Training & Documentation: Develop training materials and deliver technical workshops on using Run:AI in Kubernetes environments. Maintain up-to-date documentation on best practices for configuring and managing Kubernetes clusters for AI/ML workloads, focusing on high availability, performance, and security.
Minimum Qualifications
- 4+ years of IT-related work experience with a Bachelor's degree.