top of page

Site Reliability Engineer (SRE)

Job List Number :

OP_02

Full-time / Permanent

Petaling Jaya, Selangor (On-site / Hybrid)

RM 7,000 – RM 12,000

About the Role

As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance,
and scalability of the company’s key services.

You will design resilient system architectures, define SLOs/SLIs, automate operations, troubleshoot
complex issues, and collaborate with DevOps and development teams to improve system health
and reduce operational toil.

Your work will directly enhance service uptime, user experience, and engineering productivity.

Requirements

  • Strong experience with Linux systems & distributed computing

  • Skilled in troubleshooting application performance, connectivity & system issues

  • Familiar with networking concepts and diagnostic techniques

  • Hands-on experience with Bash/Shell scripting; Python/Go/Java is a plus

  • Understanding of system architecture, scalability, and reliability design

  • Knowledge of SRE principles: SLOs, SLIs, toil reduction, post-mortems

  • Experience with AWS / Azure / GCP and cloud operations

  • Strong problem-solving and a proactive mindset

  • Ability to work independently & collaborate within multi-team environments

  • Open to shift rotations

  • Mandarin proficiency is an added advantage

Preferred Skills

  • Observability: Prometheus, Grafana, Alertmanager, Loki, Jaeger, OpenTelemetry

  • Kubernetes, Helm, service mesh (Istio/Linkerd)

  • Apache Kafka, Flink, Spark, large-scale data pipelines

  • Terraform, Ansible, CI/CD pipelines, IaC

  • Multi-region cloud deployments

  • Disaster recovery, incident response, chaos engineering

Responsibilities

- Monitor, maintain & improve system performance, reliability & uptime
- Design scalable and resilient architectures for mission-critical services
- Build automation tools to reduce manual tasks and operational toil
- Define, track & analyze SLOs and SLIs
- Lead post-mortem processes and implement long-term reliability improvements
- Collaborate with teams to establish system reliability best practices
- Troubleshoot issues involving databases, networks & deployment failures
- Ensure SLAs are met through timely incident resolution
- Identify performance bottlenecks & recommend enhancements
- Maintain documentation for incident handling and processes
- Improve monitoring solutions to detect issues proactively
- Support deployment and configuration of new services
- Participate in on-call rotations and respond to critical incidents
- Analyze logs & metrics to identify trends and improve systems

Salary and Benefits

- Competitive salary
- Medical & insurance coverage
- Opportunities to grow into Senior SRE / Cloud Architect roles
- Cloud certifications & training support
- Exposure to modern cloud architecture & high-reliability systems
- Collaborative, engineering-focused culture

About the Company

We are a cloud-focused engineering organization dedicated to platform reliability,
infrastructure scalability, and high-performance operations.

Our teams ensure mission-critical applications remain stable, resilient, and fast—supporting
enterprises across various industries.
We emphasize SRE best practices, collaboration, and continuous improvement to create
a culture where engineers can innovate, automate, and elevate system reliability.

If you enjoy solving complex platform issues, improving systems, and driving operational excellence,
this environment is built for you.

bottom of page