GL - Site Reliability Engineer (SRE)

  • Remote
  • Contracted
  • Experienced

Job Summary

As we continue to build and scale our cloud-native systems and services, we are looking for several Site Reliability Engineers (SRE) to help keep our projects running smoothly. Collaborating with our product teams, you will help ensure that services and features operate at an established level of availability, scalability, performance, observability, and maintainability (and help define the SLI/SLO for new products).
This role reports to an Engineering Manager, and is expected to take on-call responsibilities as part of the team on-call rotation schedule.
Some of the technologies we use (familiarity is good, proficiency in all of these is not required)

  • Kubernetes
  • MySQL, DynamoDB, Redis, SQS, SNS
  • AWS, Terraform (IaC) 
  • Ambassador, Helm, Argo CD
  • REST, gRPC
  • Node.js, Kotlin, Java, Go
  • Datadog

You’ll be:

  • Acting as a conduit between product development and platform engineering teams to ensure services meet defined SLO.
  • Help identify Service Level Indicators (SLI) and define Service Level Objectives (SLO) to assess the stability and reliability of all our applications.
  • Continuously influence our engineering practices to consistently improve our MTTR from production incidents.
  • Working to bring efficiency and standardization to our incident command practices and norms.
  • Accountable for metrics and monitoring of team services in production providing data and transparency that helps to track SLOs for a product.
  • Enhancing existing services and applications to increase availability, reliability, and scalability in a microservices environment.
  • Building and improving engineering tooling, process, and standards to enable faster, more consistent, more reliable, and highly repeatable application delivery.

You’ll have:

  • At least 3 years of previous experience working as an SRE.
  • History of working on a large scale product in either Java or Node.
  • Experience implementing monitoring and alerting for services and establishing SLOs for services. (We use Datadog, but other tool experiences are fine.)
  • Strong understanding of working in a cloud-native ecosystem (We use AWS, but we’ll consider other cloud experience).
  • Experience deploying to and orchestrating containers (Docker, Kubernetes)
  • Strong understanding of CI/CD Pipelines using Git, Artifactory, Argo CD etc.
  • Experience automating and building tooling to interact with existing APIs to support the observability and general reliability of service deployment.
  • Previous experience working with Terraform is a plus!

About Our Client:

Our client is an Atlanta-based fintech company that makes a debit card for kids and companion
apps for the family. We proudly serve more than 4 million parents and kids, with in-app tools for
sending money, setting savings goals, monitoring balances, managing chores, automating
allowance, and investing.
Our client is on a mission to support parents and help every kid grow up to be financially healthy
and happy.

Read More

Apply for this position

Apply with
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file