Lucas Group has partnered with a growing SaaS company on their search for a Remote Sr. Site Reliability Engineer.
Scope of Responsibilities:
- Part of a team that troubleshoots applications, middleware, infrastructure, networks, tools, patching
- Maintains/updates on ongoing operations and project tasks.
- Build enhancements within an existing software architecture and suggest improvements to the architecture.
- Assists in defining the appropriate operational planning.
- Collaborates on architectural design reviews and changes.
- Own, define and improve metrics, KPIs, SLOs and visualizations for systems.
- Act as an escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs).
- Drive quality accountability within the organization with well-defined processes, metrics, and goals for process quality. This includes leading effective postmortems and ensuring actions are followed-up.
- Building, and maintaining, robust, actionable alerting and monitoring systems and workflows. Influence across boundaries and at all levels of the organization.
- Implement, maintain and improve CI/CD processes and tools.
- Work closely with development teams to improve services, deployments and releases.
- Troubleshoot production issues and continued documentation of runbooks.
- Part of an on-call rotation to address production issues.
- A desire to automate everything. Whether that be infrastructure as code or tooling to eliminate toil, automation should be a core focus of your mindset and the elimination of repetitive tasks should be a constant desire in the role.
- A mindset of total ownership - you aren’t afraid to dig into things you’ve never worked on before, from the browser all the way to the persistence layer. You’ve got a solid foundation in debugging and can jump in when needed to any problem you’re asked to help with.
- An architectural mind. You understand the fundamentals of distributed computing and look for ways to make systems more resilient, self-healing, and eliminate the need for human intervention as much as possible.
- Very strong communication and interpersonal skills allowing the candidate to work well in a team environment and deliver excellent customer service.
- The ability to convey the importance of site reliability in both business and technical terms to a wide variety of audiences that range from non-technical to the most technical of engineers. Drive stakeholder buy-in of key metrics such as SLAs/SLOs for all supported systems.
- Ability to maintain SLAs through the implementation of proactive issue detection and reporting
- Experience developing scripts or tools for automating administrative tasks.
- Prior successful experience as a systems performance or site/systems reliability engineer.
- Demonstrated experience working in large, complex systems environments.
Educational & Work Requirements:
- Degree: B.S. Computer Science
- Work Experience: 5+ years in a Site Reliability role
Title: Sr. Site Reliability Engineer
Client Industry: SaaS company
Lucas Group ID:1585969