Site Reliability Engineer - Madrid, España - Celonis

    Celonis
    Celonis Madrid, España

    hace 2 semanas

    Default job background
    Descripción

    The Team

    As part of our scaling Actions Platform team, you'll have a huge impact on helping teams and engineers build and operate resilient, reliable and scalable systems. You'll have ownership over our product's health, ensuring end-to-end availability and peak performance.

    We are on the path to providing a first-class service, so we want our product to be super healthy and reliable at all times

    The Role

    Collaboration is a huge part of our Celonis culture Within this role, you'll help teams catch issues before they affect customers, and tie reliability to business outcomes.

    By helping our Product teams understand the reliability of their services and how they can improve it, our teams and engineers will be able to build, deliver, and operate resilient, reliable systems.

    The work you'll do

    • Take ownership of complex issues related to performance, reliability, and scalability, from idea inception to production, including all required technical and organizational improvements.
    • Help our engineering teams gain full control over the stability and performance of their services.
    • Support and drive the investigation and resolution of incidents and issues in production.
    • Monitor and maintain object and data storage solutions.
    • Lead postmortems and root cause analysis to facilitate continuous improvement.
    • Design, write, and deliver software that enhances the availability, scalability, and efficiency of our products.
    • Proactively identify, plan, and execute improvement opportunities to minimize risks, address recurrent issues, automate manual processes, improve quality, and streamline our software deliveries.
    • Provide technical leadership on reliability to engineers, managers, and product managers.
    • Improve our monitoring, metrics, and KPIs, as well as define and implement missing SLOs.
    • Implement processes and automation to prevent problem recurrence.
    • Share acquired knowledge and document accordingly while implementing SRE best practices.
    • Guide a technical roadmap for reliability to enable the planning and building of reliable solutions using our infrastructure and developer productivity platform.

    The Qualifications You Need

    • Experience in Software Engineering roles, typically with 5+ years of experience.
    • Master's degree in Computer Science or equivalent experience and skill set.
    • Experience in developing and running large-scale productive services with Docker and Kubernetes.
    • Experience working with in-memory data stores (e.g., Redis), RDBMS (e.g., Postgres), AMQP (e.g., RabbitMQ), and NoSQL (e.g., ElasticSearch).
    • Experience working with various public cloud providers (AWS, Azure, or GCP) and modern cloud monitoring system observability frameworks (e.g., Datadog).
    • Solid knowledge of scripting languages (e.g., Bash, Python, Ruby...).
    • Experience with Java, Javascript, or Spring frameworks would be a plus.
    • Proven problem-solving skills and the ability to troubleshoot complex technical issues.
    • Deep commitment to maintaining high system reliability and availability.
    • Experience in supporting or mentoring other developers in running services reliably in production.
    • Excellent communication and collaboration skills to work effectively with cross-functional teams.
    #J-18808-Ljbffr