Cloud Engineer
Job ID: 112574
Location: Plano, Texas [Hybrid]
Category: Infrastructure
Employment Type: Contract
Date Added: 05/11/2026
Role Summary
A senior Cloud Engineer with expertise in building and managing scalable observability and infrastructure platforms for enterprise-level cloud microservices environments. This hybrid role demands hands-on experience with container orchestration, cloud infrastructure automation, and high-volume monitoring systems. The engineer will own end-to-end components, support production operations, and leverage AI tools for system troubleshooting and code generation.
Responsibilities
- Design, develop, and operate observability platforms enabling logging, metrics collection, and tracing for cloud-based microservices applications.
- Manage and optimize large-scale Kubernetes clusters across multiple regions, including Helm chart management, pod scheduling, and resource tuning.
- Own and maintain CI/CD pipelines using tools such as Argo CD, Helm, and GitOps methodologies to ensure reliable deployment workflows.
- Implement Infrastructure as Code (IaC) solutions utilizing Terraform on AWS to provision and manage cloud infrastructure at scale.
- Operate and maintain monitoring ecosystems including OpenSearch/Elasticsearch, Prometheus, Grafana, Splunk, and Kafka, ensuring high availability and performance.
- Develop automation solutions to detect, respond, and remediate production issues proactively.
- Ensure security and compliance by managing vulnerability patching and automating security best practices in container environments.
- Collaborate with cross-functional teams to improve system reliability, scalability, and performance, contributing to distributed system design.
- Participate in on-call rotations, incident response, and post-incident analysis to uphold SLA commitments.
- Utilize AI-assisted coding and troubleshooting tools to accelerate system development, automation, and incident resolution.
Qualifications
- Bachelor's degree in Computer Science, Information Technology, or related field.
- Minimum of 8 years of experience in DevOps, SRE, or platform engineering roles supporting production cloud environments.
- Proven incident response experience, including alert triage, root cause analysis, and SLA management in 24/7 operations.
- Expertise in Infrastructure as Code principles with proficiency in Terraform, Ansible, or similar automation tools for cloud provisioning.
- Strong scripting skills in Python, Golang, or Bash for automation, tooling, and CI/CD pipeline integration.
- Extensive experience operating and troubleshooting large-scale Kubernetes workloads, including Helm chart management and multi-cluster orchestration.
- Hands-on knowledge of observability stacks such as OpenSearch, Prometheus, Grafana, Loki, and Splunk, including query optimization and capacity planning.
- Familiarity with Kafka and AWS MSK, including cluster operation, topic configuration, and schema management.
- Experience deploying, managing, and migrating Splunk Enterprise environments with Kubernetes-based log shipping architectures.
- Working knowledge of OpenTelemetry, distributed tracing, and application performance monitoring in cloud environments.
- Understanding of security frameworks, container hardening practices, and vulnerability remediation at scale, including standards such as FedRAMP, STIG, IL5, ISO 27001, and SOC 2.
- Experience using AI tools like LLMs, GitHub Copilot, or custom AI agents to enhance operational workflows and incident management.
- Effective communication skills and the ability to work independently in a hybrid work setting.
Publishing Pay Range: $65.00 – $67.00 hourly
This position offers a hybrid schedule, with time split between the office and remote work.
