Observability Engineer

Role Overview 

We’re looking for a Reliability Engineer to help design, build, and scale our observability platform and drive improvements through our SRE program. In this hands-on role, you’ll contribute directly to the systems that monitor the health and performance of our applications, helping engineers understand system behavior, troubleshoot issues quickly, and continuously improve service reliability. 

You’ll be instrumental in shaping how we observe, measure, and improve our systems helping define a culture of accountability, ownership, and operational excellence. We’re a team that values curiosity, collaboration, and a growth mindset. This is a great opportunity to influence both the tooling and practices that support large-scale, reliable software delivery. 

Another key deliverable for this role is to help us improve reliability within our critical applications.  The reliability engineer will collaborate with application teams to determine what needs to happen to deliver high levels of uptime and performance for our customers. 

Key Responsibilities:

  • Build and scale observability systems: Design and maintain infrastructure for collecting, aggregating, and analyzing telemetry data (metrics, logs, and traces). 
  • Enable actionable insights: Develop dashboards, alerts, and visualizations that turn raw data into clear, meaningful information for engineers, SREs, and business stakeholders. 
  • Collaborate across teams: Partner with engineering, operations, and SRE teams to define SLIs/SLOs and improve visibility into system performance and health. 
  • Drive best practices: Advocate for and support consistent instrumentation, effective alerting, and strong observability practices across engineering teams. 
  • Optimize systems and tools: Continuously assess performance, usage, and cost of observability tools, identifying opportunities for improvement and efficiency. 
  • Automate: Engineer capabilities that will drive the adoption of SRE principles and best practices into what is deployed within the environment. 
  • Improve: In collaboration with engineering teams develop plans to improve the reliability of applications and infrastructure and assist these teams with the engineering of these improvements. 
  • Support incident response: Participate in and help improve the incident response process, reducing MTTR and contributing to post-incident reviews and root cause analysis. 

 

Required Skills & Experience:

Technical Skills 

  • Programming experience in languages like Go, Python, Java, or Node.js. Able to contribute tools and advise on application-level instrumentation improvements. 
  • Observability tooling expertise within these tools: 
  • LGTM (Loki, Grafana, Tempo, Mimr) 
  • Datadog 
  • Cloudwatch 
  • Prometheus 
  • Pagerduty 
  • ClickStack 
  • VictoriaMetrics 
  • Groundcover 
  • Libre  
  • Zabbix 
  • Cloud experience with AWS and services like EC2, EKS, ECS, VPC networking 
  • Containers & orchestration: Familiarity with Docker and Kubernetes. 
  • Infrastructure as Code & automation: Experience with tools like Terraform, Ansible, Chef, or SCCM to manage observability infrastructure at scale. 
  • Linux systems knowledge: Strong understanding of Linux, shell scripting, and the storage/networking stack. 
  • Tracing: Deep understanding of tracing technology and OpenTelemetry 
  • SRE Practices: SLIs, SLOs, Error Budgets, and Failure Domains 

Soft Skills 

  • Strong analytical skills for interpreting data and identifying trends or anomalies. 
  • Clear and effective communication—both written and verbal—for working with technical and non-technical stakeholders. 
  • Able to influence different teams on monitoring standards and how to improve the reliability of their applications. 

Apply Here

Learn more about Provate

Provate is a global Digital Solutions Company for Fortune 500 and fast-growing organizations alike around the world. Learn who we are and why we are different.