DevOps Engineer (AI Infrastructure)
OOm Pte LtdJob Description
Role Overview You will be responsible for setting up and managing the CI/CD pipelines, infrastructure automation, and cloud environments that power our AI/ML workflows. This role is ideal for someone who thrives in fast-paced environments and is excited by the challenge of enabling scalable AI product delivery. Key Responsibilities - Design, implement, and maintain CI/CD pipelines for AI models, APIs, and supporting applications. - Set up and manage cloud infrastructure (AWS, GCP, or equivalent) with a strong focus on scalability, cost optimization, and security. - Support containerized environments using Docker and Kubernetes (EKS, GKE, etc.). - Work closely with AI engineers and software developers to automate data pipelines, model training/deployment, and monitoring. - Implement and maintain infrastructure as code (IaC) using tools like Terraform or Pulumi. - Monitor system performance, troubleshoot production issues, and ensure system reliability and uptime. - Enforce best practices in DevOps, security, versioning, and documentation.
Job Requirements
- 3+ years of DevOps, Site Reliability Engineering, or relevant infrastructure experience. - Strong hands-on experience with cloud providers (AWS and GCP preferred). - Solid understanding of CI/CD principles and experience with tools like GitHub Actions, GitLab CI, or Jenkins. - Experience with Docker, Kubernetes, and container orchestration. - Familiarity with IaC tools such as Terraform, CloudFormation, or Pulumi. - Working knowledge of networking, security, and access control in cloud environments. - Exposure to machine learning or AI deployment workflows is a strong plus. - Comfortable collaborating with cross-functional teams including data scientists, backend engineers, and product managers. Nice to Have - Experience deploying AI/ML pipelines with tools like MLflow, Airflow, or Kubeflow. - Understanding of GPU/TPU setup and auto-scaling strategies for training/inference workloads. - Monitoring and logging using Prometheus, Grafana, CloudWatch, or similar tools. Why Join Us - Work on real AI products with tangible impact. - Autonomy to shape and optimize our AI infrastructure. - A collaborative and ambitious team, with leadership open to innovation and experimentation. - Opportunities for growth and cross-disciplinary exposure across AI, web, and product development.
Work Location