Site Reliability Engineer

IT & Telekom
Detta uppdrag är inte längre tillgängligt.
Our client is seeking a Senior Site Reliability Engineer who excels at working at the Operational side of
DevOps. Attention to detail, proactivity, and problem-solving skills are key, as is the ability to communicate and collaborate effectively.
 
Location: Gothenburg, minimum 3 days on site
Language: Fluent English
 
Job description
Position: Senior SRE Engineer within Platform Operations and Support
• A service minded team player with a quality driven approach
• Manage and dispatch incident and service requests.
• Provide high quality support, drive trouble shooting, RCAs and be advisor to Dev teams
• Be responsible for maintaining the platform availability, shorten time to market for new features, and improve performance.
• Play a crucial role in troubleshooting and quality assurance from an end-to-end perspective.
• Focus on understanding, monitoring, and improving the production system, actively preventing future incidents.
• Be a leading star for continuous improvements and innovations.
 
Overview of responsibilities
System support & troubleshooting
• Guiding and coordinating junior colleagues within the team.
• Assist in initial technical analysis for production incidents.
• Support development team in building capabilities for alerts and monitoring.
• Conduct code review for reported cases, fixes development, and delivery.
 
Infrastructure Automation and Configuration Management
• Develop and maintain automation tools, scripts, and configuration management systems.
• Implement Infrastructure as Code (IaC) practices using tools like Ansible, Terraform, or
Kubernetes.
• Collaborate with development and operations teams to automate build, test, and deployment
processes
 
Reliability Engineering and Resilience
• Design and implement systems and processes to enhance infrastructure reliability and
resilience.
• Continuously improve system reliability by analyzing logs and trends, identifying areas for
improvement, and implementing preventative measures.
 
System Monitoring and Incident Response
• Develop and manage monitoring tools and systems to track software and infrastructure
health, performance, security, and availability.
• Set up alerts, dashboards, and metrics for proactive detection and response to incidents.
• Investigate and diagnose root causes of incidents and work towards resolution in a timely
manner.
 
Continuous Improvement and Collaboration
• Drive a culture of continuous improvement by identifying areas for automation and efficiency.
• Document procedures, incidents, and best practices for knowledge sharing and team
efficiency.
• Stay updated on industry trends and emerging technologies to propose innovative solutions.
• Collaborate closely with cross-functional teams to ensure smooth operation of systems.
 
Required skills & experience.
• Bachelor's degree in computer science, Engineering, or a related field (or equivalent
experience) with 5+ years of DevOps SRE work.
• Proficient in scripting/programming languages such as Python, Bash.
• Experience with cloud platforms (AWS preferred).
• Experience in DevOps practice, CI/CD, and monitoring tools.
• Experience with automation tools and configuration management frameworks such as
Terraform, AWS CDK, Puppet, or Ansible.
• Strong troubleshooting and problem-solving skills with a keen attention to detail.
• Excellent communication and collaboration skills to work effectively in a cross-functional team
environment.
• Strong experience in system administration, infrastructure management, or site reliability
engineering.
AWS
bash
Configuration
Production
Programming Languages
Python
Team Player
Troubleshooting
Emerging Technologies
Monitoring
Devops
Automation
English
Ansible
Kubernetes
Puppet
CI/CD
Terraform
Code review
 Göteborg , Västra Götalands län
Period
ASAP - Öppen