Site Reliability Engineer - US East Coast (100% remote)

Giant Swarm GmbH

15 days ago

Giant Swarm is a fast-growing open-source infrastructure management platform used by modern enterprises. Our vision is to empower developers around the world to ship great products.

We're a distributed, diverse, and growing team, spread across Europe. The company is based in Cologne, Germany, where we have a small office in a coworking space. However, less than 5% of us actually work there. All workflows are designed to function remotely - but of course, if you want to visit Cologne, you are more than welcome!

What we offer on top:

  • Choose the hardware you like the most!

  • Family first - we have more kids than employees!

  • Join our team at conferences all over the globe!

  • Internal Hackathons - we love to challenge ourselves!

  • 2 Off-Sites per year (check our photos onInstagram)!

What’s the most outstanding part about working for Giant Swarm?
"It's a long list, but for me, the most important thing is the people. It's great to be surrounded by so many smart people - there's a lot of work to do but it doesn't feel like an uphill struggle because everyone pulls their weight so well"

(Simon Weald, SRE)

While we are remote-first, we appreciate quality time with our co-workers, so we meet in person twice a year to work and have fun together.

Work-life integration

  • Flexible working hours, and working from home or anywhere you prefer

  • Currently, the number of kids from our team members outnumbers the number of employees.

  • We don’t only care about the kids “within” the company, but also about all children - for example, we compensate the carbon of all our flights.

  • As an international company, we want to create similar standards for everyone, regardless of location. So, additional perks (for example, a location-aware, fixed amount paid each month to cover costs like co-working, phone contracts or gym memberships), paid parental leave and healthcare compensation are compulsory.

Your Job

  • You maintain, operate and upgrade our own and our customer’s Kubernetes clusters.

  • You will design, configure, build, and maintain our core infrastructure, from kernel parameters to the cloud provider templates.

  • You understand how servers and systems work and you tweak their behavior to your needs.

  • You will be responsible for our monitoring, logging and alerting.

  • You will help resolve incidents on our own and our customer’s clusters.

  • You participate in the on-call support schedule (~ one 24 hours shift every 2 weeks)

  • You are a go-to person in case our developers need advice regarding infrastructure.

  • You will automate all the things.


  • You must have deep, hands-on knowledge of Kubernetes from both the end-user and the operation side.

  • You have wide experience with and are able to debug Networking, Security, Linux (Kernel, Namespaces, cgroups).

  • You have great debugging skills and you are not afraid to deep dive into thousands of lines of logs.

  • You have decent coding skills, preferably in Go. You have experience with maintaining infrastructure with code.

  • You know the good and bad parts of various automatization tools (Terraform, Chef, Puppet, Ansible or Saltstack).

  • You are fluent with CNCF products running on top of Kubernetes (prometheus, grafana, ingress controller, …) you know how to use them and how to configure them.

  • You have a decent knowledge of storage including software-defined storage.

  • You like reverse and performance engineering.

  • You automate all the things by writing code. Using bash scripts for it makes you sad :)

  • We are currently mostly distributed around Europe (around UTC), but we have recently won our first US client and are looking for someone in the same time zone. Thus, you are located somewhere at the American (North, South or Central) East Coast.

Why we think this job is worth applying for!

Impact, Impact, Impact! We are a remote-first organization with a growing team from 15+ European countries. Every new team member changes the team. This is great! People who know things we don’t are highly welcome.

“It's easier to ask forgiveness than it is to get permission” (Grace Hopper) - sure, it’s not 100% like this, but we have a strong culture of failure which, is part of our agile mindset. We don’t do things like in the guidebook. You can try things out! Our default to 100% transparency will help you here.
We play a key role in our customers' digital transformation. We have partnered up with Amazon and Microsoft to provide our solution on their cloud platforms - more will follow.
We have been in this ecosystem from the get-go and as part of the CNCF family, we feel at home in the community. As a part of Giant Swarm, you will also join this extended family.
We serve some of Europe's leading organizations and are talking to many more.

What’s the most challenging part of your job?"Finding time to concentrate on a specific task (especially if it's in-depth) - SREs context-switch a lot"
(Simon Weald, SRE)

Interested? Questions? Coffee? Contact Mirco ( or apply directly!