About the Team: You will be part of a dynamic team that is at the heart of ensuring seamless operations across our platforms. Our team thrives on collaboration, innovation, and a commitment to delivering best-in-class services. We work closely with cross-functional teams including DevOps, Infrastructure, and Application Support to ensure the highest levels of service availability and performance. About the Role: We are looking for a detail-oriented and proactive Monitoring Engineer to join our team. This role involves setting up, managing, and maintaining monitoring systems that ensure the availability, performance, and reliability of our technology stack. As a Monitoring Engineer, you will play a crucial role in identifying and addressing potential system issues before they impact users.
Responsibilities:
24x7x365 on call support (in rotation) to manage and execute on the Incident Management process.
Fast and effective response to service failure Alerts and Notifications from a range of systems.
Impact and Severity Assessments of service failures, both internal and external stakeholders.
Management of Bank/PG’s downtime or other services against SLA Targets.
Escalation of downtime within the bank/PG’s, as well as internally.
Accurately tracking on progress and escalations on issues & internal ticketing systems.
Updating merchants/internal stakeholders on the status of any service outage, either directly by phone and email or via the ticketing tool.
Notifying merchants via email of any planned maintenance, either internal or Bank/PG.
Managing the outcomes of Reason for Outage (RFO/RCA) and Major Incident Reports (MIR) both internally and externally.\
Hands on experience on Database (SQL)
Hands on experience on Python, shell scripting.
Software Development in terms of automating repeatable Operations tasks (TOIL).
Schedule and lead all continuous improvement activities, including Incident reviews, Change implementation reviews, TOIL automation candidate areas etc.
Based on post-incident reviews, he/she will need to optimize the Software Development Life Cycle (SDLC) to boost service reliability.
To ensure a seamless flow of information between teams, site reliability engineer job may require documenting the knowledge gained.
Requirements:
2-5 years of experience as an Application Support Engineer/Monitoring Engineer
Excellent communication skills
Patient & friendly attitude with excellent interpersonal skill
Ability to work on own initiative, working to and meeting tight deadlines
Flexible and able to work within a 24/7/365 shift pattern & rotational shifts
Advance Knowledge of SQL/Linux, Python/shell scripting or any code expertise
Should have knowledge of payment flow/process.
Must have SRE background more focused on automation.
What we offer?
A positive, get-things-done workplace
A dynamic, constantly evolving space (change is par for the course – important you are comfortable with this)
An inclusive environment that ensures we listen to a diverse range of voices when making decisions.
Ability to learn cutting edge concepts and innovation in an agile start-up environment with a global scale
Access to 5000+ training courses accessible anytime/anywhere to support your growth and development (Corporate with top learning partners like Harvard, Coursera, Udacity)