الوصف الوظيفي
Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application.
Collaborate with cross-functional teams to define and establish desired service levels
Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues.
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance.
Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents.
Automate repetitive tasks and processes to improve efficiency and reduce manual intervention.
Create and maintain documentation for system architecture, configuration, and troubleshooting procedures.
Perform capacity planning and resource allocation to ensure optimal system performance and scalability.
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards.
Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering.