INTERNAL
Enterprise IT Ops Automation & Alerting Ecosystem

SREIT OperationsAutomationChatOpsNode.jsPythonSystem MonitoringWebhooks
Enterprise IT Ops Automation & Alerting Ecosystem
1. Problem
The IT Operations team was overwhelmed with "toil"—repetitive, manual tasks involving server health monitoring, log checking, and weekly performance reporting. This reactive approach meant the team only discovered critical system failures after users complained, resulting in an unacceptably high Mean Time To Recovery (MTTR) and significant engineering burnout.
2. Solution
Engineered a centralized, event-driven IT Operations Automation Suite. I developed custom bots and automated data pipelines that proactively monitor infrastructure health, parse system anomalies, and dispatch real-time alerts directly to the engineering team's communication channels (e.g., Telegram/Slack), alongside fully automated weekly health reports.
3. Architecture
- Execution Environment: Node.js / Python (Daemon Services)
- Integration Layer: REST APIs, Webhooks, Telegram/Slack Bot API
- Task Scheduling: Advanced Cron orchestration
- Data Aggregation: Automated SQL query extraction and log parsing
4. Key Engineering Decisions
- Event-Driven Alerting over Manual Polling: Instead of requiring engineers to constantly stare at Grafana dashboards, the system uses anomaly-detection thresholds to "Push" critical alerts instantly to mobile devices.
- Decoupled Notification Gateway: Built the notification logic as an abstracted layer. If the company decides to migrate from Telegram to Slack or Microsoft Teams tomorrow, the core monitoring logic remains untouched; only the gateway connector needs updating.
- Automated Data Pipelines for Reporting: Replaced hours of manual Excel/SQL data gathering with automated scripts that extract, format, and deliver operational health reports to management every Monday at 08:00 AM automatically.
5. Challenges
- Alert Fatigue: Initially, the bots triggered too many "warning" notifications for minor CPU spikes, causing engineers to ignore them. I refined the thresholds and implemented "debounce" logic so that alerts only fire if an anomaly persists for X minutes.
- Secure Credential Management: Ensuring that automation scripts running on production servers access database credentials and bot tokens securely without hardcoding them into the scripts.
6. Result
- Drastically reduced MTTR by alerting the engineering team to server anomalies minutes before they caused user-facing downtime.
- Eliminated over 15 hours of manual reporting and monitoring tasks per week, allowing the Ops team to focus on strategic infrastructure improvements rather than firefighting.
7. Future Improvements
- Implement an "Auto-Remediation" workflow. For example, if the bot detects a specific service failure, it doesn't just send an alert, but automatically attempts to safely restart the service first before escalating to a human engineer.