Job description
Help Us Build The Future of Travel
At Airalo, we're making it easier for people to stay connected wherever they travel. As the world's first eSIM store, we help millions of travelers access affordable mobile data in 200+ countries and regions around the world.
Today, we're a team of 400+ people across 60+ countries, building a product used by travelers every day. We've grown quickly, but we've worked hard to keep what matters: trust, ownership, and the freedom for people to do great work without unnecessary layers or bureaucracy.
We're fully remote by design, genuinely global, and united by a shared mission to make travel simpler for everyone.
Your Next Destination
- Location: Remote, anywhere in Spain or the UK.
- Contract:
- Spain: Full-time, permanent contrato indefinido via Deel (our employer of record in Spain)
- UK: Full-time, permanent
- Benefits: Learn more about our benefits here in this link - https://airalo-public.notion.site/Benefits-25396a97ffca81fb9bc1f0be479f1be3?pvs=74
- Languages: English is our main working language day to day, so you'll need to be comfortable communicating in it both in meetings and async.
We are looking for a Senior Site Reliability Engineer to join our growing engineering team.
We are a company that values SRE principles and practices. We believe in empowering our SREs to make data-driven decisions, automate operational tasks, and continuously improve the reliability of our systems. We foster a blameless culture where everyone is encouraged to learn from mistakes and share knowledge. If you are passionate about building and maintaining highly reliable systems, we would love to hear from you!
What you’ll do:
- Lead the design of scalable, fault-tolerant and self-healing systems in a multi-region AWS environment.
- Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies.
- Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures.
- Identify patterns of manual work and lead the development of internal tools/automation to permanently eliminate them.
- Develop and maintain automated runbooks and playbooks for common operational tasks and complex incident response.
- Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights.
- Proactively identify and mitigate operational risks through chaos engineering and architecture reviews.
- Work with software engineers to design systems for reliability, scalability, and maintainability from the early stages of the SDLC.
- Continuously evaluate and optimize system performance, capacity, and cost efficiency.
- Beyond just participating, you will refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health.
Must-haves:
- Bachelor’s degree in Computer Engineering or a similar discipline.
- 5+ years of experience as a Site Reliability Engineer or in a similar role.
- 3+ years of experience with AWS services including strong knowledge of container orchestration.
- 2+ years of Kubernetes experience
- Deep understanding of observability principles and tools such as: Prometheus, Datadog, OpenTelemetry and similar.
- Experience with leading incident management and complex postmortem analysis.
- Experience and interest in managing infrastructure as code (Terraform).
- Experience with chaos engineering and other techniques for testing system resilience.
- Experience with CI/CD tools such as GitHub Actions for automated delivery.
- Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling.
- Event-driven architecture experience (SNS, SQS etc)
- Ability to work independently and collaboratively in a fast-paced environment.
- Team player and open to new ideas.
- Good communication skills and fluency in English.
Good to have:
- Prior experience with Scrum and other agile methods.
- Certification in relevant areas such as AWS Certified DevOps Engineer, Certified Kubernetes Administrator (CKA), or similar.
- Prior experience with Telco Core Networks (e.g., 5G/LTE Packet Core, IMS, Signaling) and low-latency networking.
- Experience with AI-driven SRE tools for anomaly detection and improvements
- Contributions to open-source SRE projects or communities.
- Prior work experience in telecommunications.
- Deep understanding of eSIM and GSMA related technologies and services.
If you are interested in this position,
please apply via the link.
This job post has been translated by AI and may contain minor differences or errors.