As a well rounded systems reliability engineer with a diverse set of skills, this makes you one of the very best people to troubleshoot, monitor the platform, and be on top of releases. You should definitely be the type that appreciates diversity in your day, and challenges outside of your comfort level! A typical day might include these types of activities:
- Taking charge of the build process and pipelines across the platform.
- Being keenly aware of systems architecture and automatically adding in redundancy and backup for new systems and software.
- Assist in troubleshooting a complex customer issues across network devices, server hardware, virtual machines, in-house software and open source software. Not only can you run tcpdump with filters on the command line, but you can read it there also.
- Adding additional monitoring and alerting on all systems across the platform that will help you identify one of those annoying intermittent issues you have seen in the logs.
Skills & Requirements
The right candidates will probably have a CS degree, solid scripting and automation skills, great troubleshooting skills across the OS and network, a good grasp on security concepts, experience with routing platforms and protocols, and enjoy working collaboratively.
Specific requirements include:
- Experience in automating tasks through scripting. You should be very well versed with Python, and probably a few other languages. We will ask for script samples.
- High degree of drive to improve and automate your environment with minimal guidance
- Be able to solve for immediate, and plan to accommodate for future problems
- Experience with Ansible, Salt, Chef, Puppet, Terraform, or CFEngine. Experience with Ansible and Terraform preferred.
- Experience with build pipelines, integration testing and Jenkins.
- Experience administering a wide variety of *nix platforms, including multiple Linux variants.
- Solid understanding of Layer 2 and Layer 3 protocols including IPv4/6, 802.1Q, BGP, MPLS, etc., and understanding a multitude of different network architectures.
- Experience with Google Compute, AWS, or other cloud based compute and database services.
- Understand the importance and implementation of backup and redundancy across many layers of databases, systems, and network configurations.
Some knowledge that would be a huge plus:
- Familiarity administering/troubleshooting Juniper/Cisco/Arista platforms.
- Experience with extremely large scale network management and monitoring.
- Experience with Postgresql, TimescaleDB, ElasticSearch