Site Reliability Engineer

Remote
Full Time
Experienced
Our client is seeking a Site Reliability Engineer (SRE) that will be responsible for ensuring the reliability, performance, and scalability of the software, websites, and applications. This role requires a combination of software engineering and systems administration skills to monitor, control, and automate systems. The ideal candidate will have a deep understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance. This position plays a critical role in maintaining the overall health and efficiency of our platform.

Key Responsibilities:

System Monitoring and Maintenance:
‍- Monitor the performance and reliability of Kubernetes clusters, software, websites, and applications.
- Automate routine maintenance tasks to ensure system stability and performance.

Incident Response and Troubleshooting:
- Respond to and resolve incidents in a timely manner, minimizing downtime and impact on users.
- Conduct root cause analysis to identify and address underlying issues.
- Develop and implement strategies to prevent future incidents and improve system resilience.

Automation and Infrastructure Management:
‍- Design, build, and maintain automated systems and processes to improve efficiency and reduce manual intervention.
- Manage cloud infrastructure, including provisioning, scaling, and optimizing resources.
- Collaborate with development teams to ensure seamless deployment and integration of new features and updates.

Performance Optimization:
‍- Analyze system performance and identify areas for improvement.
- Implement performance tuning and optimization techniques to enhance system efficiency.
- Collaborate with cross-functional teams to ensure optimal performance of all components.

Security and Compliance:
‍- Ensure compliance with security best practices and industry standards.
- Implement and maintain security measures to protect systems and data.
- Conduct regular security audits and vulnerability assessments.

Documentation and Reporting:
‍- Maintain accurate and up-to-date documentation of systems, processes, and procedures.
- Generate and analyze reports on system performance, incidents, and other key metrics.
- Provide regular updates to management and stakeholders on system health and performance.

Continuous Improvement:
- Identify opportunities for improving system reliability, performance, and scalability.
- Stay up-to-date with industry trends and best practices in site reliability engineering.
- Participate in training and development opportunities to enhance skills and knowledge.

Qualifications:
- Deep expertise of Kubernetes and containers.
- Strong understanding of cloud infrastructure, automation tools, and best practices for maintaining high availability and performance.
- Experience with monitoring and logging tools such as Loki, Grafana.
- Minimum of 3 years of experience in site reliability engineering, Kubernetes administration, or a related role.
- Excellent problem-solving skills and attention to detail.
- Strong communication and interpersonal skills, with the ability to work effectively with cross-functional teams.
Share

Apply for this position

Required*
Apply with Indeed
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*