Focuses on disaster recovery planning, training, and testing of federal IT systems.
Audit of State of Va. cloud outage
May 31, 2011 at 7:28 pm #131642
Title: Audit of Northrop Grumman's Performance Related to the DMX-3 Outage and Associated Infrastructure.
The information included in this audit covers many aspects of the reviewed environment, including computing, storage, data recovery, monitoring, and management. Implementation of the recommended actions provides the groundwork for creating a more agile, proactive environment that can respond to critical incidents in a timely and efficient manner. The recommended actions within each topic should be reviewed and an assessment of the scope of work to implement such actions be created.
Human error during the memory board replacement process resulted in the incurred extended outage. Both memory boards zero (0) and one (1) were reporting errors prior to the memory board replacement. Memory board one (1) was reporting hard errors and memory board zero (0) was reporting soft errors. As stated in EMC’s RCA, “the initial determination to replace memory board 0 first did not take into account the uncorrectable events that had posted on board 1” and “Based on extensive post-incident analysis, EMC has determined that replacing memory board 1 first would have prevented any issues during the replacement activity itself.”
During the interviews and in a review of the data, two issues kept presenting themselves. The first is that in the professional opinion of Agilysys, the absence of a process for determining the conditions in which the data replication mechanism (SRDF) should be suspended allowed this process to continue to run even though maintenance was being performed in an environment of unusual risk. As mentioned in different sections throughout this document, it is the professional opinion of Agilysys that the lack of proactive planning regarding when to suspend the SRDF replication mechanism in the absence of remote “Point in Time” copies of data was the cause of data corruption experienced in the SWESC recovery site. If replication had been stopped prior to the hardware replacement, incremental restores of data for customers subscribing to the tier one replication service could have been completed from the SWESC location, reducing recovery times and streamlining recovery actions. Northrop Grumman indicates that a process is in place to assess if the SRDF replication mechanism needs to be stopped in the event of a possible corruption event, but it does not appear that a full impact analysis has been completed to identify events that would require the stopping of SRDF, or that documented procedures have been provided to support staff. This exercise should be completed to avoid the issues experienced during the August 25th event.
The corruption of the Global Catalog and other critical databases highlights the second issue. Namely, a lack of data protection in key environments. Although the mainframe environment uses “Point in Time” snapshot/clone copies to recover from data corruption/disruption events, this process is not used in the open systems environment. The question arises as to why the same level of criticality has not been assigned to other key applications that the enterprise relies upon. A business impact analysis review should be implemented in conjunction with VITA and state agencies to reassess the recovery time and recovery point objectives needed for key data. Recovery metrics should be based on business criticality, revenue loss, and how long a disruption of service can be absorbed by the business.
The lack of active monitoring of the environment also raises a concern. It is the professional opinion of Agilysys that the configuration management database and OpenView architecture is not at the maturity level expected at this point in its lifecycle, and that the degree and kind of current reporting on detailed application and system dependencies available during events, which is needed to enable a stable environment, is inadequate. It was also noted that historical reporting for error trending is only available for forty five (45) days. Furthermore, there is no official process as to when and if to notify state agency application owners when a system outage is observed. During interviews with Northrop Grumman staff and sub-contractors it was stated, that there are efforts underway to implement projects to improve upon historical trending and reporting, and add additional application dependency information to the configuration management database, but none are currently initiated. It is the professional opinion of Agilysys, that these two issues should have been part of the initial design and requirements.
Download Audit (pdf file) http://www.governor.virginia.gov/News/docs/NorthropGrummanAudit.pdf
May 31, 2011 at 7:30 pm #131645
You must be logged in to reply to this topic.