
AI Infrastructure Reliability Manager
2 days ago
This position is a unique opportunity to lead a team of engineers focused on defining and achieving reliability metrics for internal and external products and services.
As a reliability engineering leader, you will oversee the development of service level objectives that balance availability/latency with development velocity across the organization.
You will be responsible for guiding your team in architecting high-availability language model serving infrastructure capable of supporting millions of external customers and high-traffic internal workloads.
The ideal candidate will have experience managing and scaling reliability or infrastructure engineering teams, possess deep technical knowledge of distributed systems observability and monitoring at scale, and understand the unique challenges of operating AI infrastructure.
This role requires excellent leadership and communication skills, with the ability to influence at all levels and drive adoption of SLO/SLA frameworks across the organization.
The team is expected to make significant improvements in reliability for Anthropic's services while pioneering the use of modern AI capabilities to reengineer how we approach reliability engineering.
We are seeking an experienced engineer who can effectively lead technical discussions, translate between ML engineers and infrastructure teams, and build a strong engineering culture focused on reliability, operational excellence, and innovation.
- Responsibilities:
- Lead and grow a team of reliability engineers
- Drive the development of service level objectives (SLOs)
- Oversee the design and implementation of comprehensive monitoring systems
- Guide the team in architecting high-availability infrastructure
- Develop automated failover and recovery systems
- Establish incident response processes
- Direct cost optimization initiatives
- Partner with cross-functional teams
- Build a strong engineering culture
- Experience:
- Managing and scaling reliability or infrastructure engineering teams
- Distributed systems observability and monitoring at scale
- Operating AI infrastructure
- Technical Knowledge:
- Distributed systems
- Observability
- Monitoring
- Ai infrastructure
We offer competitive compensation and benefits, optional equity donation matching, generous vacation and parental leave, flexible working hours, and a lovely office space in which to collaborate with colleagues.
Additional Information:We value impact – advancing our long-term goals of steerable, trustworthy AI – rather than work on smaller and more specific puzzles.
We believe that the highest-impact AI research will be big science, and we're seeking exceptional candidates who collaborate thoughtfully with Claude to realize this vision.
Please ensure to provide either your LinkedIn profile or Resume, we require at least one of the two.
],-
Dublin, Dublin City, Ireland Anthropic Full timeEngineering Manager, AI Reliability EngineeringDublin, IEAbout AnthropicAnthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working...
-
AI Infrastructure Engineer
5 days ago
Dublin, Dublin City, Ireland beBeeArtificial Full time €100,000 - €150,000Build the Future of AIKaseya is a leading provider of complete IT infrastructure and security management solutions for MSPs and internal IT organizations worldwide powered by AI.You will design, build and maintain the automated backbone of our AI platform, driving CI/CD, Infrastructure as Code (IaC), and observability to enable secure, scalable, and rapid...
-
AI Reliability Engineer
5 days ago
Dublin, Dublin City, Ireland beBeeReliability Full time €100,000 - €200,000Job DescriptionWe are seeking a talented Reliability Engineer to join our team. In this role, you will be responsible for developing and implementing monitoring systems, designing high-availability infrastructure, and leading incident response efforts.The ideal candidate will have extensive experience with distributed systems observability and monitoring at...
-
Dublin, Dublin City, Ireland Anthropic Full timeStaff Software Engineer, AI Reliability EngineeringDublin, IEAbout AnthropicAnthropic's mission is to create reliable, interpretable, and steerable AI systems.We want AI to be safe and beneficial for our users and for society as a whole.Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working...
-
Dublin, Dublin City, Ireland Anthropic Full timeStaff Software Engineer, AI Reliability EngineeringJoin to apply for the Staff Software Engineer, AI Reliability Engineering role at AnthropicStaff Software Engineer, AI Reliability EngineeringJoin to apply for the Staff Software Engineer, AI Reliability Engineering role at AnthropicGet AI-powered advice on this job and more exclusive features.About...
-
Dublin, Dublin City, Ireland Anthropic Full timeStaff Software Engineer, AI Reliability EngineeringJoin to apply for the Staff Software Engineer, AI Reliability Engineering role at AnthropicStaff Software Engineer, AI Reliability EngineeringJoin to apply for the Staff Software Engineer, AI Reliability Engineering role at AnthropicGet AI-powered advice on this job and more exclusive features.About...
-
AI Agent Developer Evangelist
2 days ago
Dublin, Dublin City, Ireland Naptha AI Full timeOverviewAI Agent Developer Evangelist | We are seeking an exceptional AI Agent Developer Evangelist to build and nurture relationships with frontier AI developers and shape the future of AI agent development. This is a rare opportunity to influence the future of AI agent infrastructure at a massively ambitious scale, backed by industry veterans and technical...
-
AI Agent Engineer
6 days ago
Dublin, Dublin City, Ireland Naptha AI Full timeJoin to apply for the AI Agent Engineer role at Naptha AI.About The RoleWe are seeking an AI Agent Engineer to join our team at Naptha AI, where you'll help build and test AI agents using our interoperability platform. This role is perfect for developers who are passionate about AI and eager to gain hands-on experience with the latest agent frameworks and...
-
Senior Site Reliability Engineer
4 weeks ago
Dublin, Dublin City, Ireland Epoch Biodesign Full timeCrusoe is building the World's Favorite AI-first Cloud infrastructure company.We're pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications.Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the...
-
Senior Site Reliability Engineer
2 weeks ago
Dublin, Dublin City, Ireland Crusoe Full timeCrusoe is building the World's Favorite AI-first Cloud infrastructure company.We're pioneering vertically integrated, purpose-built AI infrastructure solutions trusted by Fortune 500 companies to power their most advanced AI applications.Crusoe is redefining AI cloud infrastructure, with a mission to align the future of computing with the future of the...