We classify three types of production issues:
outages- An outage is any user-impacting disruption of service.
bugs- Bugs are not always outages. A bug generally affects a single version of the codebase.
Examples of outages include:
meltano.comwebsite is down.
hub.meltano.comwebsite is down.
discovery.ymlweb endpoint is down.
pipx install meltanois failing for any reason - including upstream package dependency breakages, PyPI outages, etc.
Examples of critical bugs include:
priority::highest should be alerted ASAP, and should be resolved within 24 hours or sooner. By approval from a Staff Engineer or higher, the problem version may be optionally yanked from PyPi.
Always tag AJ, Taylor, and Florian when a critical bug is identified.
#meltano-alerts Slack channel receives alerts for outages and high-priority bugs.
The #troubleshooting channel is the primary place we notify users of outages and critical bugs. Depending on severity and percentage of users impacted, we may also notify users in the #announcements channel.
If you are responding to an alert in
If you have identified a production outage or a critical bug and no alert is yet logged to
When outages are expected to impact users, please share the alert or create a new notification in the
#troubleshooting channel. Users would otherwise inquire in
#troubleshooting should discover your notification and know that the Meltano team is addressing the issue.
Occasionally we observe outages due to upstream services failures.
If the issue requires action from us or is otherwise worthy of investigation, we should log an issue for tracking our work and then proceed with the alerting process.
If the issue does not require any action from us, such as a significant PyPI or GitLab service outage, we may not need to open an issue but we should nevertheless notify users as appropriate.