Entries by Incident Management Team (29)

Friday
Mar112011

NLA Service Incident - Report 

Issue:

Error within the main application that distributes XML files to XML clients.  The application was not working for approximately 2 and a half hours.

 

Root cause:

Still being investigated and has been escalated to development teams.

 

Action taken to prevent further incidents:

We have put internal monitors in place to alert the on-call engineer and 9* if this particular issues occurs again.  A temporary solution to fixing the problem is in place until we have a permanent solution.

Monday
Mar232009

NLA eClips Service Incident - Report

Problem:

 

The NLA’s London data centre lost connectivity to the Internet on Friday 20th March at 18:05. Connectivity was fully restored at 19:39. All NLA services were unavailable during this period.

 

Cause:

 

The NLA’s Internet Service Provider detected a problem with one of their upstream peers which affected some downstream customers. NLA was not one of the customers affected by this problem. While the problem was being investigated, the ISP decided to failover all customers (including the NLA) to an alternate peer at 18:00 on Friday. The failover process did not complete as expected and resulted in Internet connectivity failures for some customers - including the NLA.

 

Solution:

 

Customers of the NLA’s Internet Service Provider, which were not affected by the original upstream peering issue, were then failed back to the primary peer at 19:20. Normal connectivity was restored throughout the NLA environment by 19:39. The NLA has received assurances from its Internet Service Provider that a review of the actions taken on Friday has been completed and that the procedure for peering changes has been modified to avoid a similarly impacting event in future. Operational steps associated with changes in this procedure will be scheduled for testing during a maintenance window soon.

 

Wednesday
Feb112009

NLA eClips Service Incident - Report

Problem:

 

Some core content for publication date 11th Feb 2009 was missing from the eClips database, and from eClips feeds, by the NLA KPI deadlines. Certain pages were missing from three core titles:

Daily Mirror

The Daily Telegraph

The Guardian

 

 

Cause:

 

A core, automated NLA service malfunctioned on Tuesday 10th February 2009 at 19:24. NLA engineers were alerted to the malfunction at 19:52 and had the service running normally by 20:16. Unfortunately, the nature of the malfunction resulted in the corruption of pages delivered by publishers, to the NLA, within the aforementioned period.

 

When the corruption became apparent, publishers were engaged for retransmission of the affected pages. The normal NLA escalation process was not followed when recovering pages from The Daily Telegraph, which resulted in these pages being made available much later than expected. All the missing pages have now been re-delivered to the NLA and are being processed.

 

Solution:

 

To mitigate the risk and effects of a similar event occurring in future, the automated monitoring strategy for the affected service will be modified to alert NLA engineers of impending failure, rather than upon failure. The publisher escalation process will be reviewed to reduce the possibility of deviation from process under similar circumstances. Finally, the architecture of the affected service will be enhanced to make it more robust.

 

Tuesday
Nov042008

NLA eClips Service Incident - Report

Problem:

 

Loading of The Daily Mail into the NLA database failed on Tuesday 4th November 2008. This meant that the distribution of feeds to NLA clients did not contain 1st edition Daily Mail content by the target time of 01:00. Loading and distribution of other titles were unaffected.

 

Cause:

 

A momentary connectivity failure between the server running the loading module and the storage device to which loading takes place, caused a single thread of the loading module to loop erroneously.

 

Solution:

 

NLA engineers restarted the loading module which resulted in the loading and distribution of all the 1st edition Daily Mail content by 01:15.

 

Analysis of the loading module's source code has identified areas where modifications can be made to prevent a similar incident in future. These modifications will be scheduled soon.

Saturday
Oct182008

NLA eClips Service Incident - Report

 

Problem:

 

Certain eClips customers had intermittent access to NLA web and FTP services from 7:35am to 9:00am and from 9:19am to 9:33am on Saturday 18th October 2008.

 

Cause:

 

The owners of NLA's London hosting facility were carrying out the first phase of a planned, annual, power-down exercise on Saturday 18th October. This involved disabling one of the two power feeds which supply the NLA infrastructure. The NLA's infrastructure can usually tolerate removal of one power feed as it has a dual-fed, clustered architecture. In this instance, the automated failover of one clustered network component did not complete successfully.

 

Solution:

 

The failover process for the affected network component required manual intervention by engineers, who ensured that it completed successfully. The engineers also made some configuration changes to the cluster which should reduce the risk of a similar event occurring in future.