
Site Reliability Engineer, AI/ML Platforms
- San Jose, CA
- Permanent
- Full-time
- Identify and implement methodologies and solutions to increase reliability, scalability, security, and efficiency.
- Ensure the highest uptime and Quality of Service (QoS) for Adobe’s customers through operational excellence.
- Define service level objectives (SLOs) and indicators (SLIs) to represent and measure service quality.
- Support and maintain globally distributed, multi-cloud (public and/or private) environments.
- Automate common, repeatable tasks at a large scale to streamline operational procedures.
- Identify areas to improve service resiliency through techniques such as chaos engineering, performance/load testing, etc.
- Coordinate with other Adobe platform teams and service providers (primarily AWS) to innovate on Generative AI as a Service.
- A Bachelor's or Master's degree in Computer Science, Electrical Engineering, a related field, and 5+ years relevant industry experience.
- You excel in undefined environments and get excited about finding pragmatic solutions to complex technical or organizational challenges.
- You keep up with the industry trends and grow your knowledge and skills to solve technical problems.
- Experience in building and scaling distributed systems, as well as experience with containerization and orchestration technologies like Kubernetes.
- Production level expertise with containerization orchestration engines (e.g. Kubernetes) and proven understanding of modern, continuous development techniques and pipelines (IaC, CI/CD, ArgoCD, Git)
- Fundamental programming skills, ideally practical experience in one (and preferably more) of the following languages: Python, Go
- Good knowledge of infrastructure configuration management tools like Ansible and Terraform.
- Experience in using observability and tracing-related tools like InfluxDB, Prometheus, and Elastic Stack.
- An understanding of AI/ML, including ML frameworks, public cloud, and commercial AI/ML solutions - familiarity with Pytorch, SageMaker, HuggingFace, NVIDIA TensorRT or OpenAI Triton a plus.