الوصف الوظيفي
Program Manager – Site Reliability Engineering (Cloud Native Platform Team) Role Summary
The Program Manager will drive day-to-day operations of the Site Reliability Engineering (SRE) team, ensuring alignment with organizational goals for reliability, scalability, and operational excellence. This role requires a strong technical background in SRE practices and proven program management expertise to drive cross-functional initiatives, optimize processes, and deliver measurable business and operations value. Key Responsibilities
1. Operational Leadership
o Drive adoption of SRE best practices such as error budgets, SLIs/SLOs, and automation to reduce toil.
o Ensure compliance with security, privacy, and regulatory standards in all reliability initiatives.
2. Program Management
o Define program scope, objectives, and success criteria for reliability initiatives.
o Develop and maintain quarterly roadmaps for SRE projects in collaboration with platform engineering teams.
o Track progress, risks, and dependencies across multiple projects using tools like JIRA and Confluence.
o Facilitate communication between SRE, development, and leadership teams to ensure transparency and alignment.
3. Performance Measurement
o Establish and monitor KPIs for reliability and operational efficiency.
o Prepare executive dashboards and reports to translate technical metrics into business impact narratives.
o Lead continuous improvement initiatives based on data-driven insights.
4. Stakeholder Engagement
o Act as the primary liaison between SRE and other teams (Product, Engineering and Delivery-SOC).
o Influence decision-making at all levels through clear communication and structured reporting. Performance Measurement Parameters Incident Metrics:
o Mean Time to Detect (MTTD)
o Mean Time to Respond (MTTR)
o Mean Time to Recovery (MTTR)
o Incident Frequency and Severity Change Management:
o Change Failure Rate
o Change Success Rate Reliability Metrics:
o System Uptime / Availability
o Service Level Objective Achievement Percentage Operational Efficiency:
o Automation Rate
o On-call Burden Reduction Measurement Matrix for Leadership Presentation
Use a dashboard approach combining: Latency, Traffic, Errors, Saturation. Monthly/Quarterly trends on SLO Incident Heatmaps: Highlighting root causes and resolution times. Business Impact Metrics: Cost savings, risk reduction, and ROI from reliability improvements.
Tools: Datadog Experience Requirements Technical Background:
o Prior hands-on experience as a Site Reliability Engineer or in DevOps roles.
o Strong understanding of cloud-native architectures (Kubernetes, microservices, distributed systems). Program Management Expertise:
o 5+ years in program or technical project management.
o Proven ability to manage cross-functional initiatives in fast-paced environments.
o Familiarity with Agile methodologies and tools (JIRA, Confluence). Leadership & Communication:
o Experience presenting technical and operational metrics to executive leadership.
o Strong stakeholder management and negotiation skills. Certifications: PMP, SAFe, or SRE Foundation – SAFe preferred
لقد تمت ترجمة هذا الإعلان الوظيفي بواسطة الذكاء الاصطناعي وقد يحتوي على بعض الاختلافات أو الأخطاء البسيطة.