Job Description
D·engage was founded and is led by MarTech SaaS veterans who have decades of experience growing multiple SaaS companies from the ground up. To achieve our target of growing to a $200 million company within the next 24 months, we are looking for SaaS Service Resilience Manager as part of our technology team, who are agile, results driven, customer obsessed and love learning!
This position provides a valuable opportunity for a software developer to enhance their expertise and contribute to impactful projects. Here are the responsibilities for this position:
Key Responsibilities:
• Resilience Planning and Strategy:
- Participate in developing and implementing a comprehensive service resilience strategy for all SaaS products.
- Participate in designing and maintaining disaster recovery and business continuity plans.
- Conduct regular risk assessments and impact analyses to identify vulnerabilities and mitigate risks.
• Ownership of Production Environment:
- Take ownership and be responsible for the production environment, including cloud and on-premise infrastructure.
- Monitoring of production environments in collaboration with the VP of Development.
- Work with the VP of Security to ensure the security of the production environment.
• Team Building and Improvement:
- Build and lead a high-performing resilience team, continuously improving its quality.
- Train and improve the quality of technical support teams, including preparing training materials.
- Provide feedback to teams in problem detection and troubleshooting steps (logging, monitoring, health checks).
• Service Monitoring and Incident Management:
- Establish and manage robust monitoring systems to detect and respond to service disruptions promptly.
- Lead incident response efforts, including root cause analysis, resolution, and post-incident reviews.
- Develop and maintain incident response playbooks and procedures.
• Infrastructure and Performance Optimization:
- Collaborate with IT and engineering teams to design resilient infrastructure and applications.
- Implement redundancy, failover, and load balancing strategies to ensure high availability.
- Continuously monitor and optimize system performance, capacity, and scalability.
• Collaboration and Communication:
- Assist product and development teams in analysis when necessary.
- Analyze large-scale bugs and transfer them to the relevant teams.
- Troubleshoot problems over servers with teams when necessary.
- Provide regular updates on service resilience status, metrics, and improvements to stakeholders.
• Bug Fixing:
- Realize small-scale bug fixes (at least 3 year coding experience required)
- Analyze large-scale bugs and coordinate with relevant teams for resolution.
• Compliance and Documentation:
- Ensure compliance with relevant industry standards and regulations.
- Maintain comprehensive documentation of resilience strategies, processes, and incident responses.
- Participate in audits and reviews as required.
Requirements- Bachelor's degree in Computer Science, Software Engineering, or a related field.
- Proficiency in .Net framework
- Strong knowledge of servers such AWS, Azure and Independent on-site server
- Familiarity with version control tools like Git.
- Experience with high complex L3 queries and solutions related to server and its scalability.
- Interest and enthusiasm for technology processes.
- Collaborative skills and a predisposition for teamwork.
- Must be accountable and committed to the job.
- Willingness to learn and adaptability to new technologies.
- Fast learning ability and problem-solving skills.
- Effective communication skills and analytical thinking ability.
BenefitsWe provide,
- Competitive salary
- Meal allowance
- Health insurance
- Flexible working hours
D.engage is an equal opportunity employer committed to diversity and creating an inclusive workplace.