Site Reliability Engineer

Microsoft

Praha
Trvalý pracovní poměr
Plný úvazek

Před 14 dny

We are seeking a Site Reliability Engineer with a robust background in systems engineering, data analytics, software development, and AI/ML to join our dynamic team. The ideal candidate will be responsible for ensuring the reliability, performance, and efficiency of our services, with a focus on continuous improvement and operational excellence.Team Overview: Within the vast framework of M365 Office Engineering Direct (OED), our SRE team is instrumental to the success of Exchange Online. With the service spanning hundreds of components, our goal is clear: ensure unmatched service availability and continually elevate user satisfaction.What We Do & Our Impact: Our approach is layered and precise. By implementing proactive engineering solutions, we identify and tackle incidents head-on, ensuring limited disruptions. Monitoring, both comprehensive and nuanced, remains our cornerstone, adeptly capturing anomalies beyond the scope of conventional systems. As swift diagnostics steer our course, we channel our efforts towards automation, efficiently managing the incident lifecycle from detection to resolution. Additionally, with a commitment rooted in understanding our users, we meticulously prioritize and execute Design Change Requests, ensuring Exchange Online's evolution aligns with user expectations.The Future – Artificial Intelligence (AI) & Machine Learning (ML) in Focus: As we look to the horizon, the fusion of AI and ML with our SRE practices beckons a transformative era for Online Cloud Services in M365. We are in the initial stages of integrating predictive analytics to anticipate issues before they manifest, allowing us to stay a step ahead. Customized ML models are being developed to intelligently sift through vast data lakes, identifying patterns and correlations previously overlooked. Our journey with AI and ML is not just about enhancement; it is about redefining reliability, precision, and the user experience in the M365 suite.This role can be remote, the person can work from home 100%.#CSEResponsibilities:Technical Knowledge and Domain-Specific ExpertiseA core attribute we are looking for is an individual with strong aptitude for problem-solving, coupled with a proven track record in debugging complex systems and applications, showcasing an analytical mindset and meticulous attention to detailResearches and maintains deep knowledge of industry trends as well as advances in large-scale distributed systems and cloud technologies; identifies opportunities to create, implement, and/or optimally utilize new tools, technologies, and/or processes to solve ambiguous problems and improve product availability, reliability, efficiency, observability, and/or performance.Drives the adoption of innovative solutions across engineering teams working with related products within an organization.Applies advanced statistical and machine learning techniques to analyze large datasets and extract meaningful insights.Has experience working with all service aspects of high throughput and multi-tenant services, ability to understand and design workflows carefully, properly handle errors, write clean and well-factored code with good tests and good maintainability.Contributions to Development and Design
Engages with product engineering teams by partaking in code/design reviews, participating in on-call rotations and incident responses throughout product development and operations cycles; leverages end-to-end technical expertise on underlying systems/platforms and insights from engagements with product engineering teams and telemetry analyses to propose scalable improvements in code and designs with attention to customer/business objectives and incident prevention.Driving Operational ExcellenceDevelops code, scripts, systems, or platforms that automate moderately complex but repetitive operations processes (e.g., monitoring, alerting, deploying products and updates, debugging) at scale; reviews existing automation code and scripts to evaluate reusability, extendibility, and scalability within an organization.Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of systems, platforms, or products operating at scale.Contributes to the development of new tooling and/or predictive models to identify and test potential improvements in product development and/or operations and monitors the impact of changes on operations metrics (e.g., Time-to-X) within an organization.Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting complex issues, and deploying appropriate fixes to resolve root cause(s); alerts product teams, owners, and leadership to issues with major customer/business impact and escalates resolution of the overly complex, ambiguous, and impactful issues to include other engineering teams and/or subject matter experts as needed.Shares details related to incidents and their resolution through post-mortem reports and during regular review meetings with more experienced Site Reliability Engineers (SREs) and members of product engineering teams.Mentors and coaches less experienced engineers to help them identify and propose relevant solutions.Qualifications:Bachelor's or Master's degree in Computer Science, Data Science, AI, or a related field.
Software development experience: automation-related experience is most valued. Scripting languages such as bash, python, and PowerShell, or compiled languages such as C, C# are most relevant, but others are acceptable.
Awareness of, and ability to reason about, modern software & systems architectures, including load-balancing, queueing, caching, distributed systems failure modes, microservices, and so on.
Associated troubleshooting skills, including the ability to follow RPC (Remote Procedure Call) call-chains across arbitrary network steps. Consequent understanding of monitoring in distributed systems.
Deep understanding of operating system level concepts such as processes, memory allocation, and the network stack; understanding of how applications are affected by the above, and ability to debug same.
Experience with working in a team, including coordinating large projects, communicating well, and exercising initiative when presented with problems.
Practical experience running large scale online systems is always an advantage.Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations, and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the .

Microsoft

Odpovědět