DevOps Engineer – Onsite 3 days
Job ID: 112526
Location: San Jose, California [Hybrid]
Category: App/Dev
Employment Type: Contract
Date Added: 05/01/2026
Role Summary
This position is a senior-level DevOps Engineer responsible for supporting and optimizing cloud-based collaboration platforms. The role involves operating, scaling, and maintaining observability platforms, Kubernetes environments, and automated deployment pipelines to ensure reliable and efficient large-scale distributed systems. The ideal candidate possesses extensive production experience, a strong operational discipline, and a focus on automation and reliability.
Responsibilities
- Design, develop, and maintain observability platforms, including logging, metrics, and tracing solutions for web services.
- Manage, operate, and optimize multi-region Kubernetes clusters to support high availability and scalability.
- Own and enhance continuous integration and continuous delivery (CI/CD) pipelines utilizing Argo CD and Helm.
- Implement infrastructure as code using Terraform on Amazon Web Services (AWS).
- Operate monitoring and logging ecosystems such as OpenSearch or ELK, Prometheus, Grafana, Splunk, and Kafka.
- Develop automation tools to proactively detect, troubleshoot, and resolve production issues.
- Enforce security standards through vulnerability management, platform hardening, and compliance checks.
- Collaborate with application, platform, and security teams to improve system reliability and performance.
- Participate in on-call rotations and lead incident response activities to ensure rapid resolution of issues.
- Contribute to system architecture design, operational best practices, and review processes for distributed systems.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related technical field.
- Minimum of eight years of experience in DevOps, Site Reliability Engineering, or platform engineering roles.
- Extensive experience operating large-scale Kubernetes environments, with proficiency in container orchestration and resource tuning.
- Hands-on expertise with Helm chart management, multi-cluster operations, and pod scheduling.
- Strong knowledge of observability stacks such as OpenSearch/Elasticsearch, Prometheus/Mimir, Grafana, Loki, Splunk, or Logstash.
- Proven experience designing ingestion pipelines, query optimization, and capacity planning for telemetry systems.
- Proficiency with infrastructure as code tools like Terraform or Ansible on AWS.
- Working knowledge of scripting and automation languages such as Python, Golang, or Bash.
- Experience supporting 24/7 production environments, including incident management, alert triage, and post-incident review processes.
- Ability to work in a fast-paced environment with strong problem-solving skills.
Publishing Pay Range: $41.16 – $43.68 hourly
This is a fully remote role and can be performed from an approved location.
