RedEyes Host Monitor: Complete Setup and First 24-Hour Checklist
Overview
RedEyes Host Monitor is a server and service monitoring tool designed to detect outages, performance degradation, and configuration issues quickly. This guide walks through a practical, step-by-step setup and a prioritized checklist for the first 24 hours to ensure coverage, alerting, and basic troubleshooting workflows are in place.
Before you begin
- Requirements: Admin access to the servers/services you’ll monitor, a RedEyes account with appropriate plan, API keys if integrating with third-party tools, and access to your team’s notification channels (email, Slack, PagerDuty, etc.).
- Assumptions: Default network settings allow outbound monitoring traffic; you have at least one target host (Linux/Windows) and one service (web, DB, or app) to monitor.
Step 1 — Initial account and team configuration (0–30 minutes)
- Create or sign in to your RedEyes account.
- Add team members and set roles (Admin, Operator, Viewer).
- Configure primary notification channels:
- Email: Admin and on-call address.
- Slack/PagerDuty: Connect via integration settings and test a webhook.
- Set global escalation policies and contact schedules (on-call rotation).
Step 2 — Add and categorize hosts (30–60 minutes)
- Create a host group for each environment: Production, Staging, Development.
- Add host entries with:
- Hostname/IP
- OS type (Linux/Windows)
- Location/Datacenter tag
- Criticality level (P0, P1, P2)
- Install any required RedEyes agent or enable agentless checks (SSH/WMI) as appropriate. Verify connectivity.
Step 3 — Define key checks and thresholds (60–120 minutes)
Create checks for every critical host and service. Prioritize:
- Ping/ICMP — basic reachability.
- CPU usage — alert at 85% sustained for 5 minutes.
- Memory usage — alert at 90% sustained for 5 minutes.
- Disk usage — warn at 75%, critical at 90%.
- HTTP(S) health — status code, response time threshold (e.g., >2s).
- TCP port checks — SSH(22), HTTP(⁄443), DB ports.
- Service process checks — ensure crucial daemons are running.
- Custom application checks — SQL query health, API response validation.
Set check intervals (1–5 minutes for production-critical; 5–15 minutes for lower tiers).
Step 4 — Configure alerting rules and escalation (120–150 minutes)
- Link checks to notification rules based on host criticality.
- Define alert severity mapping (Warning, Critical, Info).
- Set automatic escalation: if unacknowledged after X minutes, notify next contact.
- Configure blackout/maintenance windows for planned changes.
Step 5 — Integrations and runbooks (150–210 minutes)
- Integrate with incident management (PagerDuty, OpsGenie) and chatops (Slack/MS Teams).
- Attach runbooks or playbooks to checks and alert types for quick remediation steps.
- Configure automated remediation where safe (restart service scripts).
First 24-Hour Checklist (Prioritized)
Use this checklist immediately after setup to validate monitoring effectiveness.
Hour 0–1: Validation basics
- Confirm all hosts show online in RedEyes dashboard.
- Send test alerts to each notification channel; confirm receipt.
- Verify agent connectivity and check execution logs.
Hour 1–4: Functional checks
- Trigger synthetic HTTP checks and validate response-time alerts.
- Simulate a service restart to ensure process checks detect state changes.
- Create a temporary disk fill (on non-production) to test disk thresholds and alerting.
Hour 4–8: Escalation and workflows
- Confirm escalation rules by leaving an alert unacknowledged and observing escalation chain.
- Validate runbook accessibility from alert details.
- Ensure on-call schedule routes correctly across timezones.
Hour 8–16: Fine-tuning thresholds
- Review alert noise: adjust thresholds or add suppression for benign flaps.
- Tune check intervals based on observed stability and load.
- Add flapping detection or alert deduplication to reduce duplicates.
Hour 16–24: Reliability and reporting
- Run a simulated incident drill (non-production) using an injected failure and follow escalation steps end-to-end.
- Review initial alert metrics: counts, false positives, and missed checks.
- Set up daily report emails summarizing uptime, alerts, and key metrics for the team.
Best practices and tips
- Start with conservative thresholds and tighten over time as you learn normal behavior.
- Use tagging consistently (service, environment, owner) to filter and route alerts.
- Prefer short check intervals for critical systems but balance against system and monitoring load.
- Use maintenance windows for deployments to avoid alert storms.
- Keep runbooks short, actionable, and version-controlled.
Troubleshooting common setup issues
- Agent not reporting: check firewall outbound, correct agent config, and system time synchronization.
- Missing alerts: verify notification integrations and test webhooks.
- Too many alerts: implement suppression, increase thresholds, or group related alerts.
Next steps after 24 hours
- Review and adjust based on real alerts.
- Expand monitoring to cover more services, edge locations, and synthetic transactions.
- Automate common remediations and refine incident postmortems into checklist improvements.
Appendix: Minimum checklist summary
- Team and notification channels configured.
- Hosts added and
Leave a Reply