Staff Site Reliability Engineer - Performance Engineering
Circle Internet Financial
- Toronto, ON
- Permanent
- Full-time
- Support multiple development teams with an agile, responsive CI/CD platform to deliver high-quality builds with measurable performance and quality;
- Build, maintain, improve, scale, and secure cloud infrastructure and resources using IaC tools (Terraform, CloudFormation, Ansible);
- Automate operational tasks via Go, Python, and serverless solutions (AWS Lambda, Kubernetes Jobs);
- Design, manage, and monitor Kubernetes clusters for multiple production workloads;
- Driving forward our blockchain infrastructure by creating and managing blockchain nodes across a wide variety of blockchains that includes Algorand, Ethereum, Hedera, Flow, Solana, Stellar, Tron;
- Participate in an on-call rotation to mitigate disruption for any production systems and conduct root cause analysis;
- Plan and test disaster recovery scenarios for a highly available microservices architecture;
- Collaborate with the Security team to create and maintain security-focused tools and frameworks and exert a top-class security posture;
- Engaging and mentoring team members and helping grow and scale the team.
- 4+ years in DevOps or SRE roles, with a focus on tooling, automation, and infrastructure on a major public cloud provider;
- Proficiency with coding and/or scripting with the following languages (Go, Python, Shell);
- You have at least 3 years of combined experience in building and maintaining CI/CD platforms and supporting agile engineering teams in building microservices;
- Experience with:
- Building Docker images and deploying containers in Kubernetes clusters;
- Any modern CI/CD platform with seemingly complex gates and workflows;
- Blue-Green, Canary, and A/B Testing deployment strategies;
- Distributed blockchain systems, running and maintaining blockchain full nodes;
- Database technologies (PostgreSQL, Redis, OpenSearch);
- Migrating and transforming large, complex datasets from diverse sources, structures, and formats;
- Data warehousing tooling and services (Apache Airflow, AWS DMS, Snowflake);
- Knowledge of networking routing, DNS, load balancing, and edge networking;
- Knowledge of APM, RUM, monitoring, and telemetry tools;
- Helm charts and deploying and maintaining Kubernetes clusters;
- Authoring and maintaining IaC with Terraform and using IaC to deploy resources in AWS, Azure, GCP, or any other public cloud providers;
- Strong skills around observability, troubleshooting, and performance solutions;
- Ability and eagerness to deep dive into understanding, debugging, and improving any layer of the tech stack;
- Exhibit strong communication skills and ability to explain technical concepts to peers and stakeholders.
- 7+ years in DevOps or SRE roles, with a focus on tooling, automation, and infrastructure on a major public cloud provider;
- Led teams technically on architecture and system design;
- Deep understanding/experience with:
- API design and REST principles;
- Cloud services (AWS, Google Cloud, Microsoft Azure, etc);
- Containers and Kubernetes;
- SQL databases and designing schemas;
- Deep focus on coding standards and code quality -- a desire to have excellent test coverage.