Difference between revisions of "OSB:20130708-01"

From Digibase Knowledge Base
Jump to: navigation, search
m
Line 2: Line 2:
 
=''OPERATIONAL STATUS BULLETIN: {{PAGENAME}}''=
 
=''OPERATIONAL STATUS BULLETIN: {{PAGENAME}}''=
  
'''Issued:''' [[User:Kradorex Xeron|Kradorex Xeron]] ([[User talk:Kradorex Xeron|talk]]) 23:44, 8 July 2013 (EDT)
+
'''Issued:''' [[User:Kradorex Xeron|Kradorex Xeron]] ([[User talk:Kradorex Xeron|talk]]) 03:14, 28 September 2013 (EDT)
  
'''In Regards To:''' Extra-facility Power Outage
+
'''In Regards To:''' Core Router Failure
  
 
'''Facility:''' Unicomplex One (Hamilton, ON, Canada)
 
'''Facility:''' Unicomplex One (Hamilton, ON, Canada)
  
'''Affected:''' *.digibase.ca (all systems, all services)
+
'''Affected:''' *.digibase.ca (all systems, all services), especially cplexus.unimatrix01.digibase.ca
  
'''Ticket #:''' ''No internal ticket was issued''
+
'''Ticket #:''' CT-0000074
  
 
'''Expected Duration:''' ''Unknown''  
 
'''Expected Duration:''' ''Unknown''  
  
'''Status: ''' Event started at 19:13, ended at 21:17. Outage was estimated 2-3 hours.
+
'''Status: ''' Event started at 22:40 on 25 September 2013, ended at 07:30 on 25 September 2013, failovers were completed in between. Outage of equipment was estimated 8.5 hours. Service outage was estimated 1.5 hours.
  
 
==Situation Description==
 
==Situation Description==
Starting on 8 July 2013, there was a power outage that affected multiple blocks of the city, including our facility as discribed by our electrical utility:
+
Starting 22:40 on 25 September 2013, our central plexus router experienced a hardware failure, this failure was impacting to the non-volatile storage of the system where the operating system and configuration are stored.
 
 
<blockquote>
 
'''The Power for this outage has been restored on'''<br />
 
'''Monday July 8, 2013 at 9:17 PM'''<br />
 
<br />
 
Original Power Outage Date: Monday July 8, 2013<br />
 
Time: 7:13 PM<br />
 
<br />
 
Horizon Utilities reports there is presently a limited power outage in the Downtown area of Hamilton affecting 1607 customers.<br />
 
<br />
 
The cause of the outage is: an underground distribution problem<br />
 
<br />
 
Horizon Utilities crews have been dispatched to make repairs. The estimated time for power restoration is 10:00 PM. Updates will be posted periodically.
 
</blockquote>
 
  
 
==Impact==
 
==Impact==
The incident caused our network to become unavailable to the widespread Internet. All primary, secondary and tertiary systems were powered down, systems on backup power also needed power down.
+
This incident caused our network to become unavailable to the public.
  
 
==Updates==
 
==Updates==
  
===21:17, 8 July 2013===
+
===23:30, 25 September 2013===
Facility primary power was restored.
+
Manual failover to a network switch capable of rudimentary routing was completed. Services temporarily operational.
 
 
===21:18, 8 July 2013===
 
Work begun to cold-start primary systems
 
 
 
===21:30, 8 July 2013===
 
Main computer core (X9CC-ECS) cold start power-on completed, system ACP (Application Compute Processor) was operational through outage on its own backup power for data integrity, that processor did not need cold-start procedures, No data loss had ocurred.
 
 
 
===21:31, 8 July 2013===
 
Central plexus cold-start power-on completed.
 
 
 
===21:32, 8 July 2013===
 
Mastercontrol powered and operational.
 
 
 
===21:40, 8 July 2013===
 
Main computer core (X9CC-ECS) bootup procedures completed, system operational
 
  
===21:50, 8 July 2013===
+
===04:30, 25 September 2013===
Unimatrix One declared online and operational
+
Plexus core router was put in place again.
  
===22:00, 8 July 2013===
+
===05:00, 25 September 2013===
Secondary systems powered.
+
Restoration was completed approximately 2013 09 26 05:00.
  
===22:10, 8 July 2013===
+
===07:30, 25 September 2013===
Tertiary systems powered.
+
Services operational. Traffic verified flowing.
  
===22:10, 8 July 2013===
+
48 hour monitoring commences to end 2013 09 28 07:30
Situation concluded.
 

Revision as of 02:14, 28 September 2013

OPERATIONAL STATUS BULLETIN: 20130708-01

Issued: Kradorex Xeron (talk) 03:14, 28 September 2013 (EDT)

In Regards To: Core Router Failure

Facility: Unicomplex One (Hamilton, ON, Canada)

Affected: *.digibase.ca (all systems, all services), especially cplexus.unimatrix01.digibase.ca

Ticket #: CT-0000074

Expected Duration: Unknown

Status: Event started at 22:40 on 25 September 2013, ended at 07:30 on 25 September 2013, failovers were completed in between. Outage of equipment was estimated 8.5 hours. Service outage was estimated 1.5 hours.

Situation Description

Starting 22:40 on 25 September 2013, our central plexus router experienced a hardware failure, this failure was impacting to the non-volatile storage of the system where the operating system and configuration are stored.

Impact

This incident caused our network to become unavailable to the public.

Updates

23:30, 25 September 2013

Manual failover to a network switch capable of rudimentary routing was completed. Services temporarily operational.

04:30, 25 September 2013

Plexus core router was put in place again.

05:00, 25 September 2013

Restoration was completed approximately 2013 09 26 05:00.

07:30, 25 September 2013

Services operational. Traffic verified flowing.

48 hour monitoring commences to end 2013 09 28 07:30