Site Reliability Engineer - Big Data
Reston, Virginia
Job Id:
155933
Job Category:
Job Location:
Reston, Virginia
Security Clearance:
Public Trust or Uncleared
Business Unit:
Zachary Piper
Division:
Zachary Piper Solutions
Position Owner:
Gillian Contillo
Zachary Piper Solutions is seeking a Site Reliability Engineer- Big Data responsible for building and managing a Data Platform enabling the creation of large-scale, high-throughput data products and services delivering actionable operational and business intelligence This position is hybrid two days a week onsite in Reston, VA.
**Candidate must not require any work authorization**
Key Responsibilities:
Architecting, deploying, and managing large-scale data platforms (Kafka, Spark, Hadoop, Druid) running on top of Kubernetes
Automating cluster provisioning (CICD), scaling and monitoring using Ansible, Python and Jenkins
Participating in technical designs for software solutions that combine Open-Source, Commercial and custom developed components
Ensuring platform SLOs by collecting, visualizing, and alerting on relevant telemetry
Upgrading large-scale data platforms improving system capabilities and security while ensuring minimal customer impact
Troubleshooting complex issues in large and distributed environments.
Staying up to date with the industry data platform best practices and standards, focusing on hybrid cloud environments
Supporting data platform customers
Participating in the on-call rotation monitoring production systems and responding to incidents
Requirements:
Candidate must not require any work authorization
Bachelor’s degree in computer science or a related technical field, or equivalent combination of education and experience
5+ years of experience managing big data platforms (Hadoop, Spark, Kafka, Druid)
Excellent understanding of Linux configuration and administration
Strong automation experience - Not just developing automation, but knowing why we automate and what to automate
Strong understanding of infrastructure-as-code such as Ansible
Experience with Docker or Kubernetes in a production environment
Strong written and verbal communication skills – able to clearly and succinctly describe complex issues.
Compensation:
$140,000-$150,000/year **depending on years of experience and degree**
Full Benefits -Medical, Dental, Vision, 401K, Paid Holidays, PTO, Sick Leave if required by law
This job opens for applications on 12/4/2025. Applications for this job will be accepted for at least 30 days from the posting date
#LI-Onsite
#LI-GC2
Keywords: Site Reliability Engineer, SRE, Big Data, Data Platform, Hybrid Cloud, Operational Intelligence, Business Intelligence, High-throughput Data Products, Distributed Systems, Kafka, Spark, Hadoop, Druid, Kubernetes, Docker, Linux Administration, Cluster Provisioning, CI/CD, Ansible, Python, Jenkins, Infrastructure-as-Code, Telemetry, Monitoring, Automation, Upgrades & Security, Troubleshooting, Open-Source Integration, Data Platform Management, Containerization, Configuration Management, Visualization & Alerting, On-call Rotation, Production Systems Monitoring, DevOps, Linux, automation, design, automate, large-scale, ideation, implementation, deployment, customer onboarding, support, cross-team collaboration, Data Engineering, Infrastructure, Engineering, Security, Operation Teams.