Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Incident Report

...

Page Properties


Incident Date(s) 
Incident Start Time (PST)~1:50 AM
Incident End Time (PST)~5:15 AM
Incident Duration (in Hours)~3.5 hours
Est. # Users AffectedEstimated number of users affected
Type of Users Affected

Status
colourBlue
titleStudents
 
Status
colourYellow
title Campus SYSTEMS
  


...

  • The initial problem was caught through Uptime Robot's recurring "up-ness checks"
    • Log
      View file
      nameuptimerobot-all_monitors-logs.csv
      height250
  • The error message returned to clients was:

    Code Block
    {
        "fault": {
            "faultstring": "Execution of ServiceCallout UCSB-Standard-Logly-Event-Tracking failed. Reason: timeout occurred in UCSB-Standard-Logly-Event-Tracking",
            "detail": {
                "errorcode": "steps.servicecallout.ExecutionFailed"
            }
        }
    }


  • From there Apigee's API monitoring tooling was used to discover that the majority of the 500 responses were coming from the upcheck-Response fault policy (which was actually a bit of red herring)

    • Unfortunately, Apigee's monitoring and investigation reports really didn't contain much information that could help drill in much deeper.
  • At that point, Apigee's trace tooling was loaded, and real sessions started to be monitored. This is what captured the actual 500 error event.
    The actual issue was that a timeout occurred in UCSB-Standard-Logly-Event-Tracking


    • Looking into that particular step, we could see was taking 65,314 nanoseconds (0.065 ms), which doesn't feel like a timeout.


    • With that information in hand, Steven Maglio decided to disable the "logly" step.
  • COINCIDING AT THIS SAME TIME
    • CENIC Issue 
      • Through the Google Hangouts Incident Discussion channel, around 8:09 AM Kevin Schmidt reported there was a CENIC issue which had been creating the intermittent issues.
    • Campus Firewall Maintenance
      • That same morning the Campus NOC had scheduled Firewall maintenance to occur between 5:30 AM and 7:30 AM
    • Student Affairs Remoting VPN Inaccessible
      • From at least 4:30 AM until 5:30 AM, Steven Maglio could not connect to the Student Affairs VPN endpoint.
        Image RemovedImage Added
      • Starting at 5:30 AM (when Campus Firewall maintenance was set to occur), he could not get passport.sa.ucsb.edu's DNS entry to resolve
      • Until the Firewall upgrade started at 5:30 this was the error
        Image Removed
      • After 5:30 AM it changed to this
        Image Modified
    • External Client Information
      • Instructional Development / ESCI - Has a scheduled job which pull information for Instructor Evaluation. Their job error'd out a 2:06 AM:
        View file
        nameinstructional-development-processing-log-20200207.log
        height250

...