Staff SRE Engineer
Circles.Life
- India
- Permanent
- Full-time
- Owning Infra architecture and non-functional requirements, ensuring they fit into a cohesive vision aligned with the rest of the Technology roadmap of the platform for launch.
- Propagate Site Reliability Engineering culture across the organization by sharing industry best practices, standards, approaches, documentation, and code with other engineering teams
- Design, test and troubleshoot CICD pipeline for containerized applications from build until deployment.
- Setup a continuous delivery and deployment pipeline integrated with release workflow to support release orchestration.
- Troubleshoot multi-layer and containerized applications deployed in cloud infrastructure.
- Apply automation and software to any manual and mechanical tasks or parts of the system that would benefit from it or are performed manually.
- Able to troubleshoot complicated, cross platform issues handling OS, Networking, Database in a cloud-based SaaS environment and handle live production incidents, debug/troubleshoot application and infrastructure issues, follow and implement SRE best practices.
- Conduct system discovery, analysis, and develop improvements for system software performance, availability and reliability.
- Design, write, ship, and motivate the implement solutions to increase observability, product reliability and organizational efficiency.
- Collaborate closely with software engineers and testers to ensure the system is responding properly to no-functional requirements such as performance, security, and availability.
- Document system knowledge as you acquire it over time, create runbooks, and ensure critical system information is readily available to those who need it.
- Maintain and monitoring deployment, orchestration, of the servers, docker containers, kubernetes, and general backend infrastructure.
- Keep up-to date with security and proactively identify, diagnose, and solve complex security issues.
- Participate in On-Call roster to provide weekend support.
- 10+ years of working experience in infrastructure support and CICD platform, leveraging DevOps, SRE & Agile methodologies.
- 5+ years experience designing, testing and implementing CICD pipeline to automate build, deployment and code promotion.
- 5+ years of experience in writing automation scripts, CICD pipeline and automated routine tasks using groovy / python to eliminate human dependencies.
- Prior experience in troubleshooting CICD pipeline issues for containerized and multi layer applications deployed in GCP or AWS.
- Sound knowledge to dive deep to understand the problem statement and execute structured troubleshooting mechanisms to identify the root cause and apply strategic solutions.
- Experience with CI/CD in cloud environments and container technology, Docker and Kubernetes, Docker Swarm, Helm DevOps (Git + CI/CD pipelines)
- Experience as Linux systems administrator (e.g. Ubuntu, RedHat) and command line system administration such as Bash, VIM, SSH.
- Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools - Grafana/prometheus, DataDog, Nagios, New Relic
- Extended expertise in infrastructure core components: storage, system and/or networking
- Adaptable to change and able to work independently with one team attitude.
- Ability to communicate clearly and with clarity to different stakeholders.
- Strong presentation skills to prepare powerpoint presentations and architecture diagrams.
- Capable of delivering multiple initiatives concurrently while maintaining a high level of attention to detail.
- Manage and prioritize work effectively with minimal supervision.
- Provide timely and relevant stakeholder update, project status and vital data points.
- Ability to learn new technologies as needed to provide the best solutions.
- Strong problem analysis skills to dive deep to understand root cause, provide strategic / interim solutions.
- Sound analytical skills to come up with supporting data points.
- Solid mathematical skills to enforce programmatic results validation.
- Understanding of TCP/IP networking, including familiarity with concepts such as OSI stack.
- Understanding of Internet protocols and applications such as SMTP, DNS, HTTP, SSH, SNMP etc.
- Understanding of ELK, Redis, RabbitMQ, Kafka and ETCD.
- Hands-on experience in writing infrastructure as code (IaC), configuration management as code (CMaC) and policy as code (PoaC) is a plus
- Kubernetes CKA or CKAD certification is nice to have
- AWS or GCP DevOps related certifications is nice to have
- GCP or AWS certification on cloud architecture - associate/professional is nice to have