Leading AI Reliability Engineering Specialist

2 days ago


Dublin, Dublin City, Ireland beBeeReliability Full time €150,000 - €208,920
Job Title

We are seeking an experienced engineering leader to manage our Reliability Engineering team.

This team includes Software Engineers and Systems Engineers focused on defining and achieving reliability metrics for all of our internal and external products and services.

Responsibilities:
  • Lead and grow a team of reliability engineers responsible for large language model serving and training systems, ensuring high-availability infrastructure capable of supporting millions of external customers and high-traffic internal workloads.
  • Develop service level objectives (SLOs) that balance availability/latency with development velocity across the organization.
  • Oversee the design and implementation of comprehensive monitoring systems for availability, latency, and other critical metrics.
  • Guide your team in architecting high-availability language model serving infrastructure and lead the strategy for automated failover and recovery systems across multiple regions and cloud providers.
  • Establish and manage incident response processes for critical AI services, ensuring rapid recovery and systematic improvements.
  • Direct cost optimization initiatives for large-scale AI infrastructure, focusing on accelerator utilization and efficiency.
  • Partner with cross-functional teams to align reliability engineering efforts with broader company objectives.
  • Build a strong engineering culture focused on reliability, operational excellence, and innovation.
Requirements:
  • Bachelor's degree in a related field or equivalent experience.
  • Experience managing and scaling reliability or infrastructure engineering teams.
  • Deep technical knowledge of distributed systems observability and monitoring at scale.
  • Ability to effectively lead technical discussions while translating between ML engineers and infrastructure teams.
  • Excellent leadership and communication skills, with ability to influence at all levels.
Preferred Qualifications:
  • Managed teams operating large-scale model training or serving infrastructure ( > 1000 GPUs).
  • Hands-on experience with ML hardware accelerators (GPUs, TPUs, Trainium, etc.).
  • Understand ML-specific networking optimizations and their operational implications.
  • Led teams through major reliability transformations or infrastructure migrations.
Benefits:

You will have the opportunity to work with a highly skilled team and contribute to the development of cutting-edge AI technology.

Others:

Please submit your application if you meet the requirements and have the necessary skills and experience.


  • AI Agent Engineer

    1 week ago


    Dublin, Dublin City, Ireland Naptha AI Full time

    Join to apply for the AI Agent Engineer role at Naptha AI.About The RoleWe are seeking an AI Agent Engineer to join our team at Naptha AI, where you'll help build and test AI agents using our interoperability platform. This role is perfect for developers who are passionate about AI and eager to gain hands-on experience with the latest agent frameworks and...


  • Dublin, Dublin City, Ireland Anthropic Full time

    Engineering Manager, AI Reliability EngineeringDublin, IEAbout AnthropicAnthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working...


  • Dublin, Dublin City, Ireland Anthropic Full time

    Engineering Manager, AI Reliability EngineeringDublin, IEAbout AnthropicAnthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working...


  • Dublin, Dublin City, Ireland beBeeReliability Full time €100,000 - €200,000

    Job DescriptionWe are seeking a talented Reliability Engineer to join our team. In this role, you will be responsible for developing and implementing monitoring systems, designing high-availability infrastructure, and leading incident response efforts.The ideal candidate will have extensive experience with distributed systems observability and monitoring at...


  • Dublin, Dublin City, Ireland beBeeEngineering Full time €90,000 - €120,000

    Reliability Engineering LeaderThis position is a unique opportunity to lead a team of engineers focused on defining and achieving reliability metrics for internal and external products and services.As a reliability engineering leader, you will oversee the development of service level objectives that balance availability/latency with development velocity...


  • Dublin, Dublin City, Ireland Anthropic Full time

    Staff Software Engineer, AI Reliability EngineeringJoin to apply for the Staff Software Engineer, AI Reliability Engineering role at AnthropicStaff Software Engineer, AI Reliability EngineeringJoin to apply for the Staff Software Engineer, AI Reliability Engineering role at AnthropicGet AI-powered advice on this job and more exclusive features.About...


  • Dublin, Dublin City, Ireland Anthropic Full time

    Staff Software Engineer, AI Reliability EngineeringJoin to apply for the Staff Software Engineer, AI Reliability Engineering role at AnthropicStaff Software Engineer, AI Reliability EngineeringJoin to apply for the Staff Software Engineer, AI Reliability Engineering role at AnthropicGet AI-powered advice on this job and more exclusive features.About...

  • Lead AI Engineer

    1 week ago


    Dublin, Dublin City, Ireland Cpl Full time

    Cpl have an exciting opportunity for a Lead AI Engineer to shape the future of intelligent systems. This role sits at the intersection of AI research and large-scale engineering, focusing on the development of agentic AI platforms. You'll be driving technical vision, mentoring talent, and building scalable systems that push the boundaries of what's possible...


  • Dublin, Dublin City, Ireland Anthropic Full time

    Staff Software Engineer, AI Reliability EngineeringDublin, IEAbout AnthropicAnthropic's mission is to create reliable, interpretable, and steerable AI systems.We want AI to be safe and beneficial for our users and for society as a whole.Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working...


  • Dublin, Dublin City, Ireland beBeeEngineer Full time €60,000 - €90,000

    Senior Site Reliability EngineerWe are building the world's leading AI-first cloud infrastructure company.Our vertically integrated, purpose-built AI infrastructure solutions are trusted by Fortune 500 companies to power their most advanced AI applications.We are redefining AI cloud infrastructure with a mission to align computing with the future of the...