SRE Technical Manager
Company: Leidos
Location: San Diego
Posted on: November 11, 2024
Job Description:
DescriptionMore About the Role:Leidos currently has an opening
on the Service Management Integration and Transport (SMIT) Contract
for a Site Reliability Engineering (SRE) Technical Manager. This is
an exciting opportunity to use your experience and leadership
skills to successfully execute the mission of the Navy's largest IT
services program. Under the SMIT Contract, the Leidos team is
responsible for the core backbone for the Navy-Marine Corps
Intranet, including cybersecurity services, network operations,
network engineering, service desk, seat support services, and data
transport.We are seeking a highly skilled and experienced SRE
Technical Manager to lead our Data Center Site Reliability
Engineering (SRE) team. In this role, you will manage a group of
talented engineers responsible for ensuring the reliability,
performance, and scalability of critical systems across 6-8 SRE
Pods. You will work closely with engineering, product, and
operations teams to implement best practices in automation,
incident management, and system monitoring. This role will focus on
both the strategic and operational aspects of site reliability,
ensuring that the team meets performance objectives while fostering
a culture of innovation and continuous improvement. The SRE
Technical Manager will collaborate with the Director of Site
Reliability Engineering and is responsible for supporting,
migrating, automation and optimization of software development and
deployment process, infrastructure as code, and maturing the Site
Reliability Engineering program. The manager will mentor and coach
lower level technical staff performing collaborative code reviews
to strengthen the SRE skills across the teams.What You'll Get to
Do:
- Manage and mentor 6-8 SRE teams (pods) and 60+ FTEs, providing
guidance, setting performance expectations, and fostering
professional development.
- Work collaboratively with SRE Resource Managers to staff and
maintain engineering resources for your SRE vertical teams'
reliability and scalability goals.
- Responsible for the P&L across the Data Center Services
vertical. Manage the SRE team's resources, including budget
planning, tool selection, and infrastructure investments to meet
reliability and scalability needs.
- Meet regularly with your team members, participate in
performance reviews and interviews, and development planning.
- Oversee the reliability, availability, and performance of
critical systems by leading the SRE teams within the data center
vertical in implementing monitoring, incident response, and
performance optimization strategies.
- Ensure the team adheres to best practices for system
reliability, automation, and operational efficiency.
- Drive continuous improvement initiatives by analyzing
performance metrics (e.g., SLOs, MTTR, MTBF) and identifying areas
for enhancement.
- Collaborate with operations, quality, cybersecurity and other
SRE engineering teams to define and enforce Service Level
Objectives (SLOs) and manage error budgets.
- Act as a liaison between the SRE team and other departments to
prioritize reliability and operational needs in the product
development process.
- Collaborate with senior leadership to define the SRE strategy,
set long-term reliability goals, and ensure alignment with business
objectives.
- Lead efforts to reduce operational toil through automation.
Work with the team to build or enhance automation tools that manage
infrastructure, monitor systems, and respond to incidents.
- Oversee the development and adoption of Infrastructure as Code
(IaC) tools, CI/CD pipelines, and other automation processes.
- Ensure that SRE practices align with organizational security
policies and compliance requirements.
- Collaborate with security teams to integrate
reliability-focused security practices into the design and
operation of systems.
- Ensure systems meet or exceed agreed-upon service levels by
proactively addressing potential issues and working with
stakeholders to align on reliability expectations.
- Work within a SRE team, collaborating with other Developers,
Security, and Operations, to continuously deliver products and
increase the value stream for the organization and customers.
- Embrace and champion Agile development processes and adoption
to modern Site Reliability Engineering workflows and practices
while providing technical guidance to team members and coworkers on
best practices.
- Stay up to date on the latest Site Reliability Engineering
practices and technologies.
- Strive to provide internal and external customers with
excellent customer service and world-class service.
- Resolve most conflicts between timeline, budget, and scope
independently but intuitively raise sophisticated or consequential
issues to senior management.You'll Bring These Qualifications:
- Requires BS degree (or equivalent) in Cybersecurity,
Information Security, IT, Network Engineering, Computer Science, or
related field or Master's with 6+ years of prior relevant
experience with 8-10 years of SRE or DevOps experience and at least
4 years in a leader or manager capacity.
- US Citizen with DoD Secret Clearance.
- Minimum of DoD 8570.01 IAT Level II Certification required
prior to onboarding and must maintain certification while
supporting the SMIT Contract.
- Must be able to support program execution in classified
environments and access SIPRNet from an NMCI location on short
notice (local travel).
- Exceptional written and oral communication skills including
producing technical analysis/reports, presentations and executive
level briefings with internal and external stakeholders.
- Ability to review requirements, comprehend, and solution
capabilities that satisfy customer requirements.
- Ability to work in a highly collaborative, forward thinking,
and innovation-driven environment.
- Proven experience managing teams responsible for large-scale,
distributed systems with high reliability and performance
demands.
- Strong track record of managing incidents, conducting
postmortems, and implementing reliability improvements.
- Experience implementing and managing Agile or DevOps processes,
with a focus on continuous improvement, efficiency, and team
productivity.
- Ability to lead teams through strategic initiatives such as
reliability maturity assessments, process automation, and tooling
selection.
- Solid understanding of SRE principles, including Service Level
Objectives (SLOs), Service Level Indicators (SLIs), and error
budgeting.
- Experience with commercial cloud infrastructure deployment
environments such as AWS and Azure.
- Strong knowledge of automation tools, CI/CD pipelines, and
Infrastructure as Code (IaC).
- Experience with Agile and DevSecOps/SRE concepts and best
practices.
- Hand-on experience with Atlassian products (Jira, Confluence,
Bitbucket, etc.).
- Experience creating JIRA and/or Azure DevOps workflows,
projects, custom configurations.
- Solid experience with integrating/maintaining with various 3rd
party CI/CD tools like Jenkins and Gitlab.
- Experience with automated provisioning and configuration tools
like Terraform, Cloud Formation, Ansible, or similar
technologies.
- Basic Linux skills supporting Red Hat Enterprise Linux
(RHEL).
- Working knowledge of the Risk Management Framework (RMF), DISA
STIGs.These Qualifications Would be Nice to Have:
- Previous work experience providing support to the NGEN-NMCI
program is highly desired.
- Previous technical people leadership experience of 8 or more
FTEs.
- Experience with microservices architecture and distributed
systems.
- Familiarity with serverless and event-driven
architectures.
- Certification in cloud platforms (e.g., Azure Certified DevOps
Engineer).
- Experience in high-growth environments or managing teams during
significant scaling periods.
- ITILv4 and Agile SAFe certifications or applicable
experience.Original Posting Date:2024-11-08While subject to change
based on business needs, Leidos reasonably anticipates that this
job requisition will remain open for at least 3 days with an
anticipated close date of no earlier than 3 days after the original
posting date as listed above.Pay Range:Pay Range $108,550.00 -
$196,225.00The Leidos pay range for this job level is a general
guideline only and not a guarantee of compensation or salary.
Additional factors considered in extending an offer include (but
are not limited to) responsibilities of the job, education,
experience, knowledge, skills, and abilities, as well as internal
equity, alignment with market data, applicable bargaining agreement
(if any), or other law.#Remote
#J-18808-Ljbffr
Keywords: Leidos, Lakewood , SRE Technical Manager, Executive , San Diego, California
Didn't find what you're looking for? Search again!
Loading more jobs...