Observability Engineer

View all jobs

Full Time
Hyderabad, India

Role Overview

We’re looking for a Reliability Engineer to help design, build, and scale our observability platform and drive improvements through our SRE program. In this hands-on role, you’ll contribute directly to the systems that monitor the health and performance of our applications, helping engineers understand system behavior, troubleshoot issues quickly, and continuously improve service reliability.

You’ll be instrumental in shaping how we observe, measure, and improve our systems helping define a culture of accountability, ownership, and operational excellence. We’re a team that values curiosity, collaboration, and a growth mindset. This is a great opportunity to influence both the tooling and practices that support large-scale, reliable software delivery.

Another key deliverable for this role is to help us improve reliability within our critical applications. The reliability engineer will collaborate with application teams to determine what needs to happen to deliver high levels of uptime and performance for our customers.

Key Responsibilities:

Build and scale observability systems: Design and maintain infrastructure for collecting, aggregating, and analyzing telemetry data (metrics, logs, and traces).
Enable actionable insights: Develop dashboards, alerts, and visualizations that turn raw data into clear, meaningful information for engineers, SREs, and business stakeholders.
Collaborate across teams: Partner with engineering, operations, and SRE teams to define SLIs/SLOs and improve visibility into system performance and health.
Drive best practices: Advocate for and support consistent instrumentation, effective alerting, and strong observability practices across engineering teams.
Optimize systems and tools: Continuously assess performance, usage, and cost of observability tools, identifying opportunities for improvement and efficiency.
Automate: Engineer capabilities that will drive the adoption of SRE principles and best practices into what is deployed within the environment.
Improve: In collaboration with engineering teams develop plans to improve the reliability of applications and infrastructure and assist these teams with the engineering of these improvements.
Support incident response: Participate in and help improve the incident response process, reducing MTTR and contributing to post-incident reviews and root cause analysis.

Required Skills & Experience:

Technical Skills

Programming experience in languages like Go, Python, Java, or Node.js. Able to contribute tools and advise on application-level instrumentation improvements.
Observability tooling expertise within these tools:
LGTM (Loki, Grafana, Tempo, Mimr)
Datadog
Cloudwatch
Prometheus
Pagerduty
ClickStack
VictoriaMetrics
Groundcover
Libre
Zabbix
Cloud experience with AWS and services like EC2, EKS, ECS, VPC networking
Containers & orchestration: Familiarity with Docker and Kubernetes.
Infrastructure as Code & automation: Experience with tools like Terraform, Ansible, Chef, or SCCM to manage observability infrastructure at scale.
Linux systems knowledge: Strong understanding of Linux, shell scripting, and the storage/networking stack.
Tracing: Deep understanding of tracing technology and OpenTelemetry
SRE Practices: SLIs, SLOs, Error Budgets, and Failure Domains

Soft Skills

Strong analytical skills for interpreting data and identifying trends or anomalies.
Clear and effective communication—both written and verbal—for working with technical and non-technical stakeholders.
Able to influence different teams on monitoring standards and how to improve the reliability of their applications.

Apply Here

Learn more about Provate

Provate is a global Digital Solutions Company for Fortune 500 and fast-growing organizations alike around the world. Learn who we are and why we are different.

Managed IT Services (MSP)

Managed Security Services (MSSP)

Software Development & Quality Testing

Add-On & Specialized Services

Healthcare

Manufacturing & Supply Chain

Retail & E-Commerce

Government & Public Sector

Cloud & Infrastructure Services

Security & Compliance

Software & App Development

Optimization & IT Support

Managed IT Services

Managed Security Services

Software Development & Quality Testing

Add-On & Specialized Services

Success Stories

Let's find out, how we can
accelerate your business