Senior Site Reliability Engineer

Kunai is a fast-growing digital agency of 120+ people specializing in fintech. Our fully remote team is driven by innovation and experimentation. Over the past decade, we've shipped over 150 products for clients that include Visa, the United Nations, Wells Fargo, Ernst & Young, and TOMS Shoes. Our founders built a previous agency (Monsoon) that was acquired by Capital One in 2015. As a Senior Site Reliability Engineer and resident IT operations expert, you will leverage your deep knowledge of automation to create and maintain scalable, reliable systems that are the foundation of our clients apps. As a hands-on leader you will maintain a portfolio of applications across our client projects. Your role will be challenging, fun, and interesting.
Job Type
Full-Time (Remote)
Oakland, CA (Remote)
  • Own and manage the infrastructure (Compute, Data Stores, Cloud Services, etc) that powers a portfolio of client applications.
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Support the application CI/CD pipeline for promoting applications into higher environments through validation and operational gating. Practice sustainable incident response and blameless postmortems.
  • Take a holistic approach to problem-solving by connecting the dots during a production event through the various technology stacks that make up the platform to optimize mean time to recover.
  • Requirements:
  • Hands-on experience building and maintaining modern web and mobile applications including web application firewall
  • A background debugging and identifying issues in live environments
  • Experience with serverless architecture
  • Infrastructure as code tools (IaC), such as Terraform and AWS CloudFormation, CI/CD tools such as Github Actions, CircleCI
  • Experience digging into data models and DB schema to understand the impact of DB tasks and migrations with the coming release
  • AWS services such as Lambda, SQS, Amplify, DynamoDB, AppSync, API Gateway, SSM Parameter Store, S3, CloudFront, Route 53, Elasticache, VPC
  • Strong understanding of Javascript (in particular Node.js) and ability to help engineers debug live environments
  • Experience setting thresholds for alert levels, building/maintaining monitors, building CloudWatch Dashboards, and building notifications via CloudWatch and AWS
  • Substantial scripting and high-level programming knowledge and experience specifically to systems and related tooling
  • Understanding of major infrastructure components, such as Load Balancers, Web Servers, Databases, Queueing systems, etc
  • Good understanding of modern distributed systems, their architecture, trade-offs, and operability concerns, and solutions
  • An ability to multi-task and a willingness to adapt to changes quickly
  • End-to-end project ownership while being proactive about updates/blockers
  • Monitoring tools such as Cloudwatch, Datadog, PagerDuty, Sentry, NewRelic
  • Extra Credit:
  • Experience working with offshore teams
  • Experience managing and executing Disaster Recovery testing in various environments
  • Experience with Integration Testing between internal systems and external vendors
  • Benefits

    We Remote

    Kunai is a fully remote company. If you have the skills, you can live and work anywhere you'd like.

    Flexible Hours & Vacation Time

    We expect a level of personal responsibility and leave you to figure out what suits you and your team.

    Learning and Development

    Kunai understands that constant learning is the only way to stay on top of your craft.

    Health, Dental, Vision

    Kunai provides full benefits and our plans offer a range of options.

    Maternal and Paternal Leave

    For our growing Kunai family!

    Career Coaching

    We want to help you reach your professional goals.