2020-02-07 Uptime Robot Reporting Rapid Failure Recovery Cycles

Incident Report

Details

Incident Date(s) 
Incident Start Time (PST)~1:50 AM
Incident End Time (PST)~5:15 AM
Incident Duration (in Hours)~3.5 hours
Est. # Users AffectedEstimated number of users affected
Type of Users Affected

STUDENTS   CAMPUS SYSTEMS  

Summary of the Incident

Intermittent connectivity from Apigee Cloud to the Campus Network affected the completion of calls going through Apigee. The final step of all calls going through Apigee, logs the call to an on-campus splunk instance. That particular call was being reported as a timeout (but it might have been connectivity issues), which was causing the overall Apigee call to fail with a 500 error response. The Apigee step which logs the call was disabled to remediate the problem.

Timeline (PST)

  • 1:51 AM - Uptime Robot - Reports first outage of an API and begins to send warning emails
  • 2:06 AM - Instructional Development - Scheduled Job reported error
  • ~4:00 AM - Steven Maglio sees warning emails and investigates
  • ~4:30 AM - Steven Maglio gets to a desktop computer to continue investigation
  • ~5:00 AM - Discovery of "logly" (ie. splunk) step being the reason behind the 500 errors
  • ~5:10 AM - Disabling the "logly" step in the development environment is verified to solve the issue
  • ~5:10 AM - Disabling the "logly" step is deployed to production
  • 5:14 AM - Uptime Robot - Reports all APIs are back up and functioning normally
  • 8:09 AM - Kevin Schmidt reports that the underlying networking issue had something to do with CENIC, which Kevin was able to "route around" a few minutes earlier (~8 AM)


Diagnosis Process

  • The initial problem was caught through Uptime Robot's recurring "up-ness checks"
    • Log
  • The error message returned to clients was:

    {
        "fault": {
            "faultstring": "Execution of ServiceCallout UCSB-Standard-Logly-Event-Tracking failed. Reason: timeout occurred in UCSB-Standard-Logly-Event-Tracking",
            "detail": {
                "errorcode": "steps.servicecallout.ExecutionFailed"
            }
        }
    }
  • From there Apigee's API monitoring tooling was used to discover that the majority of the 500 responses were coming from the upcheck-Response fault policy (which was actually a bit of red herring)

    • Unfortunately, Apigee's monitoring and investigation reports really didn't contain much information that could help drill in much deeper.
  • At that point, Apigee's trace tooling was loaded, and real sessions started to be monitored. This is what captured the actual 500 error event.
    The actual issue was that a timeout occurred in UCSB-Standard-Logly-Event-Tracking


    • Looking into that particular step, we could see was taking 65,314 nanoseconds (0.065 ms), which doesn't feel like a timeout.


    • With that information in hand, Steven Maglio decided to disable the "logly" step.
  • COINCIDING AT THIS SAME TIME
    • CENIC Issue 
      • Through the Google Hangouts Incident Discussion channel, around 8:09 AM Kevin Schmidt reported there was a CENIC issue which had been creating the intermittent issues.
    • Campus Firewall Maintenance
      • That same morning the Campus NOC had scheduled Firewall maintenance to occur between 5:30 AM and 7:30 AM
    • Student Affairs Remoting VPN Inaccessible
      • From at least 4:30 AM until 5:30 AM, Steven Maglio could not connect to the Student Affairs VPN endpoint.
      • Starting at 5:30 AM (when Campus Firewall maintenance was set to occur), he could not get passport.sa.ucsb.edu's DNS entry to resolve
    • External Client Information
      • Instructional Development / ESCI - Has a scheduled job which pull information for Instructor Evaluation. Their job error'd out a 2:06 AM:

Resolution and Recovery

  • Once it was decided to disable the "logly" step
  • The shared flow, UCSB-Standard-PostProxy, was updated to disable the step UCSB-Standard Splunk Event Tracking (ie. UCSB-Standard-Logly-Event-Tracking, or the "logly" step)
    • The usage of the async flag has been researched by Christian Montecino before. A follow-up with him should be setup to better understand what his research found.
  • This was then verified in the development environment, where the step was seen to be disabled.
  • The update was then deployed to the Test and Prod environments.

Corrective and Preventative Measures

  • Uptime Robot was useful in the early detection of the problem.
  • Christian Montecino has previously reached out to Apigee to determine if these post-call logging steps could be run in a way that would never impact the result of the actual API call being performed. He will most likely continuing his conversation with Apigee into determining the best way possible to do this.
    • Notes
      • The main options of how to send this information to splunk are outlined in this Apigee Community Post.
      • Currently, we are using an Apigee Service Callout Policy (Option 1. Log over HTTP). The Service Callout Policy page lists that the async attribute has been deprecated and will always be "false".
      • "Option 2, Log over TCP", uses the Message Logging Policy to make the call asynchronously. This seems to be the only option that can currently be asynchronous. However, the issue with this policy is that you cannot set a HTTP Header for the call to splunk. Our splunk's setup requires a Authentication Token to be sent in the header.
      • "Option 3, Log via javascript", uses the Javascript Policy to make the call synchronoushly. The async attribute is marked as deprecated and will always be "false".
      • Option 4 on that page only applies to on-premise Apigee installations, and we are using the Cloud version.
    • We are going to go with another approach:
      • Update the Shared Flow to move the logging into a PostClientFlow.
        ARPRSOA-414 - Getting issue details... STATUS
      • Change the Service Callout Policy to not use a <Response> element. When that element does not exist, then the policy should run as a fire-and-forget method; which should prevent errors from stopping the flow.
        ARPRSOA-414 - Getting issue details... STATUS
  • In the future, when Campus Wide networking work will be done, the Campus API Team may desire to send out notifications that it could disrupt service.

Lessons Learned

  • As a team, we did not recognize that the Campus Wide Firewall update could have an affect on the API Gateway system. Looking back, we all conceptually understood that it would have an impact, but we didn't make the logical connection that we needed to prepare and communicate to our customers about this.

Acknowledgements

  • Kevin Schmidt