
Platform SRE Engineer
- Singapore
- Permanent
- Full-time
- Develop monitoring and onboarding guidelines for various applications using observability platform stack, ensuring accurate monitoring and data collection.
- Drive Observability standards, best practices, operations and processes for the Enterprise in AppDynamics & other observability tools
- Automate routine tasks and reporting processes using APIs and scripting, reducing manual effort and improving efficiency in AppDynamics & other observability tools
- Identify and resolve performance issues through detailed analysis of transaction traces, application logs, and system metrics.
- Collaborate with stakeholders to define performance metrics and monitoring requirements aligned with business goals.
- Contribute to internal knowledge bases, create documentation, and share insights with the team to promote a culture of learning and collaboration.
- Design and implement monitoring solutions to track application performance, identifying bottlenecks and optimising system efficiency.
- Conduct performance tuning and capacity planning to ensure applications meet scalability and reliability requirements.
- Develop custom dashboards and reports to provide actionable insights and drive decision-making processes.
- Collaborate with development and operations teams to integrate Observability platform stack with CI/CD pipelines and other DevOps tools.
- Configure and fine-tune alerts to proactively detect and address performance issues before they impact end-users.
- Continuously review and enhance monitoring processes and methodologies to improve efficiency and effectiveness.
- Work with application teams to develop long-term monitoring strategies that align with business goals and technology roadmaps.
- Create data retention polices and access controls (RBAC) to manage user permissions.
- Perform application maintenance, patching, upgrading controller versions, agents etc and ensure EOS/EOL is maintained.
- Ensure continuous uptime of applications and services.
- Ensure no security or audit issues.
- Cover all areas in application and infrastructure operations of the platform.
- Strong communication skills and ability to explain protocol and processes with team and management
- A passion for learning and using new technologies in the open-source communities.
- A passion for coding.
- Min 10 years of IT work experience.
- Working knowledge in AppDynamics, ELK Stack, Grafana, Open Telemetry (OTEL)
- In-depth experience in Unix/Linux/Shell/Python scripting with quality, scalability, and extensibility.
- Experience in triaging and troubleshooting application problems quickly in monitoring tools by using various techniques - Transaction snapshots, Diagnostic Sessions, Data Collectors
- Knowledgeable and experienced in SRE (Site Reliability Engineering) practices covering monitoring, observability, performance management, automation, and resiliency.
- Knowledge in Confluent Kafka, Prometheus & other APM tools (Dynatrace, Datadog, New Relic, Splunk) is a plus.
- Knowledge in AI/ML capabilities to automate RCA’s and shorter MTTR when issues arise.
- Good understanding of Network routing, Load balancing and Networking protocols; a base knowledge of TCP/IP, with an understanding of HTTP and DNS
- Ability to contribute to discussions on design and strategy.
- Adequate knowledge of database systems (RDBMS, MariaDB, SQL, NOSQL), Object Oriented Programming and web application development.
- Good problem diagnosis and creative problem-solving skills
- Experience in NodeJS, Spring boot could be a plus.