Degraded performance on Web Application access

Incident Report for CloudShare

Postmortem

This communication is to provide you with the Reason for Outage pertaining to the issue which affected Verizon MRS Services in Miami (MIA1) on September 23-24, 2019.

Please be advised that services were affected by a dual failure:

1. Processor card failure in network router (Primary & backup)

2. Gateway router was experiencing max CPU utilization and required a full reboot.

‌

Troubleshooting history:

o The Verizon NOC identified what appears to be a failing route processor. A remote switchover was attempted but that has failed.

o We were pending dispatch to restore access and/or reboot the equipment as we have lost access to it.

o This case has been escalated with higher management within Verizon.

o At this time there is a tech on site working with an engineer to restore access to the equipment.

o We have an Equinix tech on site now and he should be calling in shortly. The plan of action is going to reboot BR1 device to restore access to the chassis and routing table. Verizon NOC is joining a crisis bridge now to discuss alternative plans as well. As this unfolds we will continue to update you.

o Troubleshooting is still on going with the Equinix tech on site. In the meantime, NOC is also seeking assistance from vendor Cisco.

o Isolation efforts are still in progress with the Equinix tech and our engineers.

o At this time, we are getting ready to have the tech on site attempt to reboot the box.

o The box has been rebooted and we now have access to it. We are continuing to troubleshoot to further isolate with the issue. At this time we are seeing several bgp sessions up. Will continue to provide updates.

o We are still on a crisis bridge with multiple internal groups and working with Cisco TAC now working on the issue. There is no ETTR at this time, but we will continue to update you with progress.

‌

o We are making progress with Cisco, after pulling 1 of the cards completely out the equipment stopped rebooting and is starting to load. It will take up to 20 minutes for it to fully load. We will update the network ticket with the progress and will continue to update this thread as well.

o After the equipment reboot, some customer services went up, but after a while the service went down again. As the service did not come up so now the engineers are working with the 3rd party providers in order to isolate the issue. This case if being treated on high priority. No ETR is available at the moment.

‌

o Troubleshooting continues on the crisis bridge with multiple internal partners. At this time, we are seeing packets leaving BR2, but not making it to GW3. Next step is going to be to shut down GW3 and force traffic to GW4 in an attempt to isolate. We will continue to update you with progress.

Posted Sep 29, 2019 - 10:03 UTC

Resolved

This incident has been resolved.

Posted Sep 25, 2019 - 06:29 UTC

Update

Our Verizon provider has made some positive strides in resolving the issue. Confirm if you are able to access CloudShare as you normally would. We are continuing to monitor network traffic.

Posted Sep 25, 2019 - 01:59 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 25, 2019 - 00:11 UTC

Update

We are continuing to work on a fix for this issue.

Posted Sep 25, 2019 - 00:01 UTC

Update

Verizon engineers are Still working hard on this issue.

Posted Sep 24, 2019 - 23:30 UTC

Update

Verizon engineers are still working hard on this issue.

Posted Sep 24, 2019 - 23:28 UTC

Update

Verizon engineers are still working hard on this issue.

Posted Sep 24, 2019 - 23:26 UTC

Update

Verizon engineers are still working on this.
They have opened a crisis bridge now with multiple parties working towards resolution, this includes tier 3, management, and cisco TAC

Posted Sep 24, 2019 - 16:41 UTC

Identified

Our ISP (Verizon) have identified a large network issue and their engineers are working on a solution at the moment. We are keeping in constant touch with them regarding the status. We will update you asap, thank You for your patience and understanding.

Posted Sep 24, 2019 - 13:38 UTC

Investigating

We are currently experiencing issues with some clients accessing our application at https://use.cloudshare.com. So far our investigation points to ISPs WAN routing problems and we are currently working with them on a solution.
As an interim solution that might work, try switching to a different network (for example your cellular carrier) as it sometimes yields a different route.
We are sorry for the inconvenience and thank you for your patience!

Posted Sep 24, 2019 - 12:08 UTC

This incident affected: Web Interface, Remote Access Services (Remote Access Services - Miami, Remote Access Services - Amsterdam, Remote Access Services - Singapore), and Inbound/Outbound Internet (Inbound/Outbound Internet - Miami).