Meltano maintains an on-call rotation schedule that involves every engineer on our team.
Each engineer participates in an on-call rotation that lasts two weeks — one week as the secondary on-call engineer, and the following week as the primary on-call engineer.
The rotation period runs from Monday to Sunday, starting/ending at noon Eastern time. Noon Eastern time is used because every Meltano engineer has regular working hours that overlap with that time, which can simplify any handoffs should an incident persist beyond the end of one’s primary on-call period.
We use OpsGenie to manage the on-call schedule, and alert on-call engineers. More details about our use of OpsGenie are provided below.
During their assigned week, the engineers are responsible for handling any high-severity incidents that arise outside of normal working hours.
Primary on-call engineer: The primary on-call engineer is responsible for responding to incidents first. They are expected to be ready to troubleshoot and solve problems as they come up.
Secondary on-call engineer: The secondary on-call engineer acts as a backup to the primary. If the primary on-call engineer is not available or requires additional assistance, the secondary on-call engineer will be contacted.
Newly hired engineers are added at the end of the on-call rotation queue to allow them sufficient time to understand our systems and processes before their turn. Their manager may elect to skip over their turn in the on-call rotation as-needed until they are sufficiently prepared/experienced to handle being on-call.
After serving a week as the primary on-call engineer, that engineer will be placed at the end of the on-call rotation queue.
Engineers may, with their manager’s approval, trade places in the on-call rotation queue, or otherwise cover for one another. It is advised that this not be done often, and only be used for exceptional circumstances. Every engineer is expected to participate in the on-call rotation.
We use OpsGenie to manage our on-call schedule and issue alerts to the on-call engineers. Alerts are issued automatically when our StatusPage detects an outage for any of the sites/APIs it monitors. Alerts may also be issued manually.
When an alert is sent, the primary on-call engineer will be notified first. If they do not acknowledge it within 12 hours, the secondary on-call engineer will be notified. The first thing the secondary on-call engineer should attempt to do is make contact with the primary on-call engineer, and coordinate with them. That said, once the secondary on-call engineer has been alerted, they should not wait for the primary on-call engineer to respond. They may begin investigating and addressing the incident immediately following their attempt to contact and coordinate with the primary on-call engineer.
On-call engineers should:
The on-call engineers are expected to triage incidents, and respond to alerts, but not necessarily address every issue that arises themselves. It is okay and expected that, should the need arise, the on-call engineers will request aid from other Engineering Team members. Other Engineering Team members are not expected to provide aid or check for work notifications outside of regular work hours, but are expected to provide reasonable aid when requested during regular work hours.
Engineers should not perform their regular assigned tasks while acting as the primary on-call engineer. The reasons for this are twofold:
Instead of working on regular assigned tasks, the primary on-call engineer may spend their time not spent dealing with incidents/alerts on whatever they’d like, provided it reasonably benefits Meltano. Being the primary on-call engineer is not a pseudo-vacation in which one may be totally unproductive while waiting for an incident/alert to respond to. Rather, this work can include addressing low-priority pet peeves, research/experimentation, professional development/training, trying out interesting hacks (à la Meltano Hack Day), and more. If you’re unsure if a given activity would be appropriate for your time as the primary on-call engineer, feel free to ask your manager about it.
Because engineers should not perform their regular assigned tasks while acting as the primary on-call engineer, during the week leading up to being the primary on-call engineer (i.e. the week in which one is the secondary on-call engineer), engineers should avoid picking up any tasks that might block others if not completed soon, and they should hand-off or complete any such tasks promptly. With the aforementioned caveats, the secondary on-call engineer is still expected to work on regular assigned tasks.
As of 2023-07-17, Meltano’s on-call policy is considered to be in effect for security incidents only.
Once Meltano Cloud has its GA release, Meltano’s on-call policy will be in effect for all incidents, including security incidents and outages.
This section of the handbook (i.e. “On-Call Rollout”) may be removed once Meltano Cloud has had its GA release.