Site Reliability Engineer

IT & Telekom

Detta uppdrag är inte längre tillgängligt.

Our client is seeking a Senior Site Reliability Engineer who excels at working at the Operational side of

DevOps. Attention to detail, proactivity, and problem-solving skills are key, as is the ability to communicate and collaborate effectively.

Location: Gothenburg, minimum 3 days on site

Language: Fluent English

Job description

Position: Senior SRE Engineer within Platform Operations and Support

• A service minded team player with a quality driven approach

• Manage and dispatch incident and service requests.

• Provide high quality support, drive trouble shooting, RCAs and be advisor to Dev teams

• Be responsible for maintaining the platform availability, shorten time to market for new features, and improve performance.

• Play a crucial role in troubleshooting and quality assurance from an end-to-end perspective.

• Focus on understanding, monitoring, and improving the production system, actively preventing future incidents.

• Be a leading star for continuous improvements and innovations.

Overview of responsibilities

System support & troubleshooting

• Guiding and coordinating junior colleagues within the team.

• Assist in initial technical analysis for production incidents.

• Support development team in building capabilities for alerts and monitoring.

• Conduct code review for reported cases, fixes development, and delivery.

Infrastructure Automation and Configuration Management

• Develop and maintain automation tools, scripts, and configuration management systems.

• Implement Infrastructure as Code (IaC) practices using tools like Ansible, Terraform, or

Kubernetes.

• Collaborate with development and operations teams to automate build, test, and deployment

processes

Reliability Engineering and Resilience

• Design and implement systems and processes to enhance infrastructure reliability and

resilience.

• Continuously improve system reliability by analyzing logs and trends, identifying areas for

improvement, and implementing preventative measures.

System Monitoring and Incident Response

• Develop and manage monitoring tools and systems to track software and infrastructure

health, performance, security, and availability.

• Set up alerts, dashboards, and metrics for proactive detection and response to incidents.

• Investigate and diagnose root causes of incidents and work towards resolution in a timely

manner.

Continuous Improvement and Collaboration

• Drive a culture of continuous improvement by identifying areas for automation and efficiency.

• Document procedures, incidents, and best practices for knowledge sharing and team

efficiency.

• Stay updated on industry trends and emerging technologies to propose innovative solutions.

• Collaborate closely with cross-functional teams to ensure smooth operation of systems.

Required skills & experience.

• Bachelor's degree in computer science, Engineering, or a related field (or equivalent

experience) with 5+ years of DevOps SRE work.

• Proficient in scripting/programming languages such as Python, Bash.

• Experience with cloud platforms (AWS preferred).

• Experience in DevOps practice, CI/CD, and monitoring tools.

• Experience with automation tools and configuration management frameworks such as

Terraform, AWS CDK, Puppet, or Ansible.

• Strong troubleshooting and problem-solving skills with a keen attention to detail.

• Excellent communication and collaboration skills to work effectively in a cross-functional team

environment.

• Strong experience in system administration, infrastructure management, or site reliability

engineering.

AWS

bash

Configuration

Production

Programming Languages

Python

Team Player

Troubleshooting

Emerging Technologies

Monitoring

Devops

Automation

English

Ansible

Kubernetes

Puppet

CI/CD

Terraform

Code review

Ort

Göteborg , Västra Götalands län

Period

ASAP - Öppen

Site Reliability Engineer

Cookie-inställningar