Incident Report
...
Page Properties | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
- The initial problem was caught through Uptime Robot's recurring "up-ness checks"
- Log
View file name uptimerobot-all_monitors-logs.csv height 250
- Log
The error message returned to clients was:
Code Block { "fault": { "faultstring": "Execution of ServiceCallout UCSB-Standard-Logly-Event-Tracking failed. Reason: timeout occurred in UCSB-Standard-Logly-Event-Tracking", "detail": { "errorcode": "steps.servicecallout.ExecutionFailed" } } }
- From there Apigee's API monitoring tooling was used to discover that the majority of the 500 responses were coming from the upcheck-Response fault policy (which was actually a bit of red herring)
- Unfortunately, Apigee's monitoring and investigation reports really didn't contain much information that could help drill in much deeper.
- Unfortunately, Apigee's monitoring and investigation reports really didn't contain much information that could help drill in much deeper.
- At that point, Apigee's trace tooling was loaded, and real sessions started to be monitored. This is what captured the actual 500 error event.
The actual issue was that a timeout occurred in UCSB-Standard-Logly-Event-Tracking- Looking into that particular step, we could see was taking 65,314 nanoseconds (0.065 ms), which doesn't feel like a timeout.
- With that information in hand, Steven Maglio decided to disable the "logly" step.
- Looking into that particular step, we could see was taking 65,314 nanoseconds (0.065 ms), which doesn't feel like a timeout.
- COINCIDING AT THIS SAME TIME
- CENIC Issue
- Through the Google Hangouts Incident Discussion channel, around 8:09 AM Kevin Schmidt reported there was a CENIC issue which had been creating the intermittent issues.
- Through the Google Hangouts Incident Discussion channel, around 8:09 AM Kevin Schmidt reported there was a CENIC issue which had been creating the intermittent issues.
- Campus Firewall Maintenance
- That same morning the Campus NOC had scheduled Firewall maintenance to occur between 5:30 AM and 7:30 AM
- That same morning the Campus NOC had scheduled Firewall maintenance to occur between 5:30 AM and 7:30 AM
- Student Affairs Remoting VPN Inaccessible
- From at least 4:30 AM until 5:30 AM, Steven Maglio could not connect to the Student Affairs VPN endpoint.
- Starting at 5:30 AM (when Campus Firewall maintenance was set to occur), he could not get passport.sa.ucsb.edu's DNS entry to resolve
- Until the Firewall upgrade started at 5:30 this was the error
- From at least 4:30 AM until 5:30 AM, Steven Maglio could not connect to the Student Affairs VPN endpoint.
- CENIC Issue
- External Client Information
- Instructional Development / ESCI - Has a scheduled job which pull information for Instructor Evaluation. Their job error'd out a 2:06 AM:
View file name instructional-development-processing-log-20200207.log height 250
- Instructional Development / ESCI - Has a scheduled job which pull information for Instructor Evaluation. Their job error'd out a 2:06 AM:
...