Domain Lead - Site Realibility Management (REF4372N) (Budapest)

Domain Lead - Site Realibility Management (REF4372N) (Budapest)
Jelentkezem az állásra ›

 

The largest ICT employer in Hungary, Deutsche Telekom IT Solutions (formerly IT-Services Hungary, ITSH) is a subsidiary of the Deutsche Telekom Group. Established in 2006, the company provides a wide portfolio of IT and telecommunications services with more than 5000 employees. ITSH was awarded with the Best in Educational Cooperation prize by HIPA in 2019, acknowledged as one of the most attractive workplaces by PwC Hungary’s independent survey in 2021 and rewarded with the title of the Most Ethical Multinational Company in 2019. The company continuously develops its four sites in Budapest, Debrecen, Pécs and Szeged and is looking for skilled IT professionals to join its team.

Domain Lead - Site Realibility Management (REF4372N)

The Domain Lead - Site Realibility Management is a senior leadership role responsible for the end-to-end reliability, resilience, and operational excellence of all IT systems across a large-scale telecommunications organization. This executive will lead a distributed team of Site Reliability Managers (SRMs) embedded throughout the company, setting the strategic direction for reliability engineering and ensuring the stability of critical business services. The role is pivotal in driving a culture of continuous improvement, proactive risk management, and blameless learning throughout the IT organization .

Purpose of the role is:

  • To serve as the organization's chief stability and reliability authority, accountable for the availability, performance, and recoverability of all IT services.
  • Lead the design and execution of a comprehensive reliability strategy, aligning with business objectives and regulatory requirements.
  • Foster a company-wide culture of resilience, incident prevention, and operational transparency .

Key Responsibilities

  • Strategic Leadership: Define and champion the company’s reliability vision, policies, and maturity roadmap. Set and monitor organizational SLOs, SLIs, and error budgets .
  • Team Management: Direct and mentor a distributed team of SRMs, ensuring consistent standards, knowledge sharing, and professional growth across domains.
  • Reliability Governance: Oversee domain-wide stability programs, coordinate cross-functional reliability initiatives, and ensure alignment with business impact priorities.
  • Incident Command: Act as the executive escalation point during major incidents, ensuring effective incident response, root cause analysis, and implementation of systemic fixes.
  • Observability & Monitoring: Ensure comprehensive observability across all platforms, driving adoption of modern monitoring tools and practices to enable proactive detection and resolution .
  • Infrastructure & Deployment: Oversee the reliability of CI/CD pipelines, infrastructure as code practices, and deployment strategies (e.g., canary releases, blue-green deployments).
  • Resilience Engineering: Lead organization-wide initiatives in chaos engineering, failure testing, and capacity planning to minimize blast radius and prevent outages.
  • Change Management: Guide risk assessment and approval of major releases and configuration changes, potentially replacing legacy Change Challenger models.
  • Stakeholder Collaboration: Partner with engineering, product, and business leaders to align reliability goals, communicate risk, and drive adoption of best practices.
  • Culture & Learning: Promote a blameless postmortem culture, facilitate reliability workshops, and ensure continuous learning and improvement.

Key Qualifications:

  • Proven executive experience in SRE, IT operations, or large-scale infrastructure leadership within complex, distributed environments.
  • Deep technical expertise in SRE principles, incident management, observability, and cloud/hybrid architectures (e.g., AWS, Azure, GCP).
  • Demonstrated success in leading cross-functional teams, driving organization-wide stability programs, and managing high-stakes incidents.
  • Strong familiarity with modern observability tools (Prometheus, Grafana, ELK, Datadog) and deployment frameworks (Kubernetes, Terraform, Ansible).
  • Exceptional communication skills, with the ability to influence senior stakeholders and coach both technical and non-technical teams.
  • Experience with ITIL, DevOps, and structured Change, Incident, and Problem Management frameworks.

Success Metrics:

  • Reduction in critical incidents, IBIs, and Mean Time to Repair (MTTR).
  • Measurable improvements in observability, monitoring coverage, and SLO adherence.
  • Implementation and tracking of preventive actions and systemic fixes.
  • Organization-wide visibility and mitigation of stability risks.
  • Delivery and execution of a reliability roadmap, with clear progress metrics .

Core Knowledge Areas:

  • SRE principles (error budgets, toil reduction, SLOs/SLIs)
  • Incident lifecycle and blameless postmortems
  • Observability and monitoring (metrics, logging, alerting)
  • Infrastructure as code, CI/CD, deployment best practices
  • Chaos engineering, load and failure testing
  • Cloud and hybrid system design, geo-redundancy
  • Governance, communication, and cross-domain collaboration

Munkavégzés helye

Budapest

Budapest

Küldjünk emailt, ha hasonló hirdetés kerül az oldalra?

Ne maradj le egy jó állásról!

Állásajánlatok - legnépszerűbb városok