Site Reliability Engineer (SRE)

Job List Number :

OP_02

🏢

Full-time / Permanent

📍

Petaling Jaya, Selangor (On-site / Hybrid)

💰

RM 7,000 – RM 12,000

Apply Now

About the Role

As a Site Reliability Engineer, you will be responsible for ensuring the reliability, performance,
and scalability of the company’s key services.

You will design resilient system architectures, define SLOs/SLIs, automate operations, troubleshoot
complex issues, and collaborate with DevOps and development teams to improve system health
and reduce operational toil.

Your work will directly enhance service uptime, user experience, and engineering productivity.

Requirements

Strong experience with Linux systems & distributed computing
Skilled in troubleshooting application performance, connectivity & system issues
Familiar with networking concepts and diagnostic techniques
Hands-on experience with Bash/Shell scripting; Python/Go/Java is a plus
Understanding of system architecture, scalability, and reliability design
Knowledge of SRE principles: SLOs, SLIs, toil reduction, post-mortems
Experience with AWS / Azure / GCP and cloud operations
Strong problem-solving and a proactive mindset
Ability to work independently & collaborate within multi-team environments
Open to shift rotations
Mandarin proficiency is an added advantage

Preferred Skills

Observability: Prometheus, Grafana, Alertmanager, Loki, Jaeger, OpenTelemetry
Kubernetes, Helm, service mesh (Istio/Linkerd)
Apache Kafka, Flink, Spark, large-scale data pipelines
Terraform, Ansible, CI/CD pipelines, IaC
Multi-region cloud deployments
Disaster recovery, incident response, chaos engineering

Responsibilities

- Monitor, maintain & improve system performance, reliability & uptime
- Design scalable and resilient architectures for mission-critical services
- Build automation tools to reduce manual tasks and operational toil
- Define, track & analyze SLOs and SLIs
- Lead post-mortem processes and implement long-term reliability improvements
- Collaborate with teams to establish system reliability best practices
- Troubleshoot issues involving databases, networks & deployment failures
- Ensure SLAs are met through timely incident resolution
- Identify performance bottlenecks & recommend enhancements
- Maintain documentation for incident handling and processes
- Improve monitoring solutions to detect issues proactively
- Support deployment and configuration of new services
- Participate in on-call rotations and respond to critical incidents
- Analyze logs & metrics to identify trends and improve systems

Salary and Benefits

- Competitive salary
- Medical & insurance coverage
- Opportunities to grow into Senior SRE / Cloud Architect roles
- Cloud certifications & training support
- Exposure to modern cloud architecture & high-reliability systems
- Collaborative, engineering-focused culture

About the Company

We are a cloud-focused engineering organization dedicated to platform reliability,
infrastructure scalability, and high-performance operations.

Our teams ensure mission-critical applications remain stable, resilient, and fast—supporting
enterprises across various industries.
We emphasize SRE best practices, collaboration, and continuous improvement to create
a culture where engineers can innovate, automate, and elevate system reliability.

If you enjoy solving complex platform issues, improving systems, and driving operational excellence,
this environment is built for you.

Apply Now