Richard Cooper, the BBC’s Controller of Digital Distribution for BBC Future Media, has posted a blog explaining the background to the serious incident over the weekend which impacted BBC iPlayer, BBC iPlayer Radio, and audio and video playback on other parts of bbc.co.uk, which also forced the Corporation to use its emergency homepage for prolonged periods of time.
“We have a system comprising 58 application servers and 10 database servers that provides programme and clip metadata. This data powers various BBC iPlayer applications for the devices that we support (which is over 1200 and counting) as well as modules of programme information and clips on many sites across BBC Online. This system is split across two data centres in a ‘hot-hot’ configuration (both running at the same time), with the expectation that we can run at any time from either one of those data centres,” he advised.
“At 9.30 on Saturday morning (19th July 2014) the load on the database went through the roof, meaning that many requests for metadata to the application servers started to fail. The immediate impact of this depended on how each product uses that data. In many cases the metadata is cached at the product level, and can continue to serve content while attempting to revalidate. In some cases (mostly older applications), the metadata is used directly, and so those products started to fail,” he reported.
“At almost the same time we had a second problem. We use a caching layer in front of most of the products on BBC Online, and one of the pools failed. The products managed by that pool include BBC iPlayer and the BBC homepage, and the failure made all of those products inaccessible. That opened up a major incident at the same time on a second front,” he continued.
“Our first priority was to restore the caching layer. The failure was a complex one (we’re still doing the forensics on it), and it has repeated a number of times. It was this failure that resulted in us switching the homepage to its emergency mode (‘Due to technical problems, we are displaying a simplified version of the BBC Homepage). We used the emergency page a number of times during the weekend, eventually leaving it up until we were confident that we had completely stabilised the cache,” he noted.
“Restoring the metadata service was complex. Isolating the source of the additional load proved to be far from straightforward, and restoring the service itself is not as simple as rebooting it (turning it off and on again is the ultimate solution to most problems). Performance of the system remained sufficiently poor that in the end we decided to do some significant remedial work on Saturday afternoon, which ran on until the evening. During that period, BBC iPlayer was effectively not useable,”.
According to Cooper, after that work was complete, the BBC was in a “walking wounded” state that allowed close to normal operation for much of the site, though BBC iPlayer remained down on a number of devices. “We chose to run it in this mode throughout the rest of the weekend while planning a full restoration of the service. By the time we were ready to do that we were entering the peak period on Sunday evening, so rather than risk the service further, we chose instead to do it on Monday morning”
Cooper said the BBC recognised that during this incident, with BBC iPlayer unavailable for some periods, some users may not have been able to watch or listen to the programmes they wanted. “I’m afraid we can’t simply turn back the clock, and as such the availability for you to watch some programmes in the normal seven day catch-up window was reduced. Essentially programmes aired on Saturday 12th July and Sunday 13th July were not available this last weekend for some users. It’s small consolation but that was the weekend of the World Cup Final, Scottish Open, Women’s Open and other live sporting events which are less likely to be viewed on catch-up. I should also stress that programmes aired this weekend – when the problems occurred – are available now on BBC iPlayer,” he confirmed.
“BBC iPlayer is an incredibly popular product, last year alone we had three billion requests and instances like this are incredibly rare. We will now be completing the forensics to make sure that we’ve fully understood the root causes, and put in place the measures necessary to minimise the chances of such interruptions in the future. We’re sorry for the inconvenience,” he concluded.