TSB Bank Disaster

This seems like an absolute disaster:

The very short version is that a UK bank, TSB, which had been merged into and then many years later was spun out of Lloyds Bank, was bought by the Spanish bank Banco Sabadell in 2015. Lloyds had continued to run the TSB systems and was to transfer them over to Sabadell over the weekend. It's turned out to be an epic failure, and it's not clear if and when this can be straightened out.

It is bad enough that bank IT problem had been so severe and protracted a major newspaper, The Guardian, created a live blog for it that has now been running for two days.

The more serious issue is the fact that customers still can't access online accounts and even more disconcerting, are sometimes being allowed into other people's accounts, says there are massive problems with data integrity. That's a nightmare to sort out.

Even worse, the fact that this situation has persisted strongly suggests that Lloyds went ahead with the migration without allowing for a rollback.

This seems to be a mistake, and not enemy action.

Posted on April 27, 2018 at 6:00 AM • 52 Comments

Comments

Bank IT BloggerApril 27, 2018 6:26 AM

In your article concerning the TSB disaster, you suggest Lloyds went ahead with the migration without allowing for a rollback, as an insider on the LBG side I can categorically state that Sabadell where in control of the scope and method of migration, from the offset it was clear that Lloyds would support the actions driven by Sabadell, we had no influence whatsoever on their options to rollback, If it were asked for I am pretty sure it would have been wholeheartedly supported

TSB Customer #123456April 27, 2018 6:39 AM

As a customer of TSB the only outage I have noticed is that which has been advertised well beforehand. I appreciate I'm only anecdotally an individual account holder and it's anecdotal. However I've been able to access my account, pay bills and had zero problems since the changeover went live on Monday. I do wonder if it's the small vocal minority that are creating all the noise, or if it is a genuine 'catastrophe', will we ever find out?

DamoApril 27, 2018 6:43 AM

"Mummy it wasnt my fault! A bigger Spanish boy came and made me do the migration! Do I still get the pocket money you promised for being good? The Spanish boy did the bad thing"

PetterApril 27, 2018 7:37 AM

Gosh.
Poor sods.
And I don’t mean TSB staff but their customers.

I wonder if they will try to pin this on the Russians...

Mike LiemanApril 27, 2018 7:55 AM

Last time I checked, rollback plans were a requirement for change management under ITIL.

IggyApril 27, 2018 8:08 AM

And the hope of ending paper and coin currency fades yet more and again. Yeah, that's not my hope, but it is for some who want to, apparently, conduct a commercial transaction as if by magic and so they don't feel the import of it. This incident is just the latest on the list of why paper money must stay the chief legal tender.

ATNApril 27, 2018 8:30 AM

Worse or better than cryptocurrency "disasters"?
What can possibly go wrong when [programming language/methodology] javascript meet [programming language/methodology] Cobol?

DustinApril 27, 2018 8:59 AM

While I understand your intention, calling it a "mistake" seems overly generous. A more appropriate word would be "incompetence".

RealFakeNewsApril 27, 2018 9:19 AM

@Bank IT Blogger:

A question for you: why was the new system not tested before live accounts were moved to it? Surely any basic form of testing would have highlighted these massive failures in the system?

It's all very well passing blame to someone else, but why did TSB not insist on a period of testing? Run the systems in parallel? Migrate only when it is demonstrated that the new system actually works?

RealFakeNewsApril 27, 2018 9:22 AM

(Sorry for spamming the discussion - I'd edit a single post if it was possible - maybe a Mod could merge my posts). [Moderator: Done.]

How this can be classed as a "mistake" is beyond words.

Gross incompetence; absolute idiocy; up.

Examples:

* People logging into their online banking to see someone else's account details
* People finding they have received money intended for another bank account
* People finding money they transferred is moved to the wrong bank account
* The wrong transaction values being moved (either received or sent)

Just how in the heck is that even possible???

There needs to be an NTSB investigation into what code went wrong, where, and how, and the results to be published.

These kinds of systems failures are not merely the result of a glitch or hardware fault - some idiot somewhere wrote something so screwed-up it failed beyond catastrophically.

Who actually wrote the Spanish system?

Why is TSB hiring IBM employees directly to fix the mess, if the problem was the Spanish side?


Just A GuyApril 27, 2018 9:32 AM

I'm not an Engineer but I work in the industry. As someone who is based in the USA it's unfathomable that a major bank would be allowed to be offline for 7 hours, let alone 7 days. If BoA, Wells Fargo, US Bank, Chase, etc. were offline hell would be unleashed by regulators, politicians, etc. That the UK authorities have been relatively quiet on this is amazing. This is a failure of epic proportions and I'm struggling to recall a similar meltdown of this scale.

The root cause won't be known for awhile but after reading posts, tweets, error messages reported by users, LinkedIn profiles & joining a couple of dots I can't imagine that using IBM - aka I'm By Myself - to fix this isn't going to help. This smells like a failure of architecture/design & having the wrong developer expertise.

Below is TSB's new cloudy stack (from a LinkedIn profile of a TIBCO employee who is assigned to Sabadell). I suspect this project lacked Engineers with heavy-duty Microservices expertise and, instead, they used old-school J2EE guys who looked at the cloud architecture through their old-school monolithic eyes. I also suspect they did not employ true SDETs for testing purposes and used traditional QA'ers. i.e. QTP & Selenium playback QA monkeys when for something of this scale they needed Amazon/Netflix/Google quality SDETs.

"Component of the BancSabadell Architecture team in the TSB project (TSB Bank in UK acquired by BancSabadell group) for the definition and implementation of a new banking platform based on the latest technologies and methodologies and oriented to a hybrid infrastructure between on-premises and public cloud

Technologies:

-PaaS (TIBCO SilverFabric)

-Micro services (Spring Cloud Netflix)

-SOA (TIBCO AMX Service Grid, TIBCO BusinessWorks, TIBCO API Exchange Gateway)

-Single Page Application (AngularJS)

-Asynchronous Messaging (TIBCO EMS)

-APM (Application Performance Monitoring)

-Distributed Search & Analytics (ElasticSearch)

-Containerization (Docker)."

It has also been posted elsewhere that Sabadell tried to port the existing code for their Spanish bank. Said code was garbage with hard-coded values for server IP addresses. They used Netflix OSS Microservices from GitHub where the copyright header was changed but not references to Netflix in their error messages. Also claimed that upper management decided that load testing assuming 500 simultaneous users was sufficient.


I can't see how TSB can recover from this. They're toast.

London BankingITChapApril 27, 2018 9:43 AM

> Just how in the heck is that even possible???

Oh, it is so very possible... Code does darned crazy things. You must test every single combination of circumstances. And still it will surprise you.

> Why is TSB hiring IBM employees directly to fix the mess, if the problem was the Spanish side?

At this level of mess you cannot hide behind "it was not my fault". Everyone must pull together: just finding out what is really happening takes a lot of resources, and there will be tons of small hacky patches to stop the bleeding right now. Root cause analysis will happen later.

Noone will fix the main issue: many banks right now have so many external consultants and big Indian body shops, that they lose the knowledge to run big projects properly. Noone in house remembered to put it in somebody's book of work to test it properly, and prepare a rollback plan.

DaveApril 27, 2018 10:01 AM

> I can't see how TSB can recover from this. They're toast.

Actually there’s a lot of lethargy in the consumer financial markets in the UK. It’s said that people are more likely to divorce than change their bank account.

Nevertheless on this occasion I think TSB will hemorrhage customers when the dust finally settles. Maybe 20% within 6 months? Plus they’ll need to pay out considerable compensation, and most likely be whacked with a whopping great fine from the regulator.

As for the brand damage, that’s incalculable. Then again, us Brits are a fickle lot. As soon as the situation stablises, I suspect most customers will stay put, rather than risk the fresh hell of the Current Account Switch Service.

AlejandroApril 27, 2018 10:29 AM

My local bank merged and decided to make things better by upgrading their online experience and thus completely trashed it.

They did it so well, most all of it's thousands of customers became unable to login to their accounts for over two weeks. In effect, something like this is an unannounced closure of the bank preventing people from getting their funds.

I complained so much, to this day, I very seldom get challenged on login. A positive was the brick and mortar branch was just down the street. LOL, they sure knew who I was during this episode.

Why not run the old and new parallel system until the bugs are worked out?

BenApril 27, 2018 10:39 AM

As another person with a lot of experience in this domain, including transition planning for critical systems on this scale, i'm struggling to guess what happened here as almost everything seems to have gone wrong - it seems to have failures in the core technical architecture, in the test process (especially under high concurrency?), and in the transition/rollback planning.

I understand there was pressure to get away from the Lloyds platform due to costs, but even with management pressure it's hard to see how it could have got through a rollout-readiness review to start the deployment, or why it then failed to invoke the backout plan when things started to go awry. Heads will roll within TSB, but I hope we eventually see some kind of retrospective as i'm sure there are lessons to be learned here.

However, i'm not sure how much it will affect the longer-term customer base - I have TSB accounts and a TSB credit card, and apart from occasional problems logging in to online banking (and general slowness when I can log in) everything has worked and I haven't been significantly affected, so I'm not sure how widespread the worst-case impact is. They also pay pretty good interest rates (3% on £6k of savings - used to be 5% on £8k - with no hoops to jump through), and i'm sure that plenty of people just use TSB as a savings account for that reason.

JereApril 27, 2018 10:42 AM

It's a disaster alright, but not completely unheard of. Danske Bank ended up losing a big share of their customers when they bought Finnish Sampo and decided to migrate the banking systems. The problems persisted for weeks - massive downtime slowly turned into corrupt and bogus data.

What I'm really curious about is your urge to explain how this isn't enemy action when there was absolutely no reason in the first place to suspect that. Somebody is seeing spooks here.

albertApril 27, 2018 10:46 AM

@Dave, etc.,
Wells Fargo is still in business.

@,
"...The more serious issue is the fact that customers still can't access online accounts and even more disconcerting, are sometimes being allowed into other people's accounts, says there are massive problems with data integrity. That's a nightmare to sort out...."

I have no details, but 'disconcerting' sound like corpo-speak to me. It's mind-boggling. Can some IT pro explain how this can happen?

@,
"...This seems to be a mistake, and not enemy action. ..."
"We have met the enemy, and he is us." - Pogo

. .. . .. --- ....

echoApril 27, 2018 10:48 AM

Feudalism reigns supreme in the UK across all major instititions. Competition within the state sector and regulation of the private sector has effectively collapsed under the sheer weight of schlerotic vested interest.

The British have a reputation of being the worst lovers in Europe. If this is a benchmark why does anything else surprise?

Denton ScratchApril 27, 2018 11:42 AM

I worked as a software contractor for the TSB before they merged with Lloyds. Actually I was under the direction of a consultancy company that was working for the TSB. It was a major project - we were building a completely new front-office system.

There were hardly any TSB permies on the team, and those that there were were deadbeats. Some of the best-paid techs in the company were a handful of COBOL programmers (contractors) who had knowledge of the legacy systems. I presumed that meant that none of the permies knew as much as these contractors.

I stopped shortly after the Lloyds merger was announced (for other reasons). As far as I am aware, merging the TSB and Lloyds IT operations took at least another ten years, and cost $LOTSOFMONEY.

@JustAGuy "This smells like a failure of architecture/design & having the wrong developer expertise."

This is not a mistake perpetrated by some employee or consultant; rather, it is a very serious management failure. If there was a mistake made by some staff member, then it is unquestionably management's fault that this was possible. Umm, which management? Not Lloyds - they've sold the business. There can be no question that the fault lies with Sabadell management.

I used to bank with the Abbey National, a former mutual society that was privatized and promptly acquired by a spanish bank (Santander). The result was a sharp decline in customer service, and a sharp increase in prices (who would have thought it!) The TSB was never mutual, although if you squinted it looked a bit like one. Nevertheless, I wouldn't have wanted an account with a spanish-owned version of the TSB, based on my experience with Santander.

Incidentally, this is all delayed fallout from the 2008 banking crash: Lloyds had to divest themselves of some of their operations for monopoly reasons, following their acquisition of HBOS (a 'rescue'), which gave then an absolute majority of high-street banking outlets. The HBOS acquisition was a huge mistake for Lloyds; I am sure they were forced into it by the UK government (which subsequently acquired a 43% stake in the company in exchange for bailing them out). Without the HBOS acquisition, Lloyds would have survived as the only UK high-street bank still standing on their own feet. I was an HBOS shareholder AND a Lloyds shareholder, and lost a lot of money in 2008. The operations that they decided to sell to Sabadell were the TSB branches.

I'm sorry for the TSB customers; they didn't have much opportunity to switch elsewhere, because nearly all the high-street banks failed in 2008. (The Coop didn't; that failed in 2013, as a result of desperately incompetent management). There is currently no example in the UK of a high-street bank with management that inspires confidence.

Who?April 27, 2018 12:34 PM

@ Petter

I wonder if they will try to pin this on the Russians...

Sure, the spanish ministry of defense did it in the past:

https://www.express.co.uk/news/world/879168/Spain-Catalonia-crisis-Russia-Mariano-Rajoy-Alfonso-Dastis-Carles-Puigdemont

Spanish government is seriously considering setting up a "Ministry of Truth" (à la 1984) that will decide if a given news is fake or not and punish any source that publishes anything labelled as "fake" (obviously not the relationship between Russia, Julian Assange and the crisis at Catalonia).

I fear for the democracy health on that nice country of the south of Europe.

echoApril 27, 2018 1:20 PM

@AlanS

This is why I'm a little quiet. The magnitude of the stupidity is one of those 'where do you begin?' things. The establishment is long overdue its 'Cosby moment'.

@Denton Scratch

Good write up.

AlanSApril 27, 2018 2:17 PM

@Echo

I keep wondering when the British government will reach the pinnacle of stupidity, incompetence and cruelty. Every week they seem to manage to outdo the previous week. If I wasn't aghast, I'd be impressed. Will they top Windrush? Seems difficult but based on past performance I wouldn't bet against it.

GondarApril 27, 2018 2:55 PM

@ Petter

Why, have Lavrov and Putin both made implausible mock-indignation denials already?

A press conference with some Syrian children perhaps?

LarryApril 27, 2018 4:24 PM

@RealFakeNews-
"There needs to be an NTSB investigation". National Transportation Safety Board?

TatütataApril 27, 2018 5:31 PM

"There needs to be an NTSB investigation". National Transportation Safety Board?

For that, they'd have to find the black-box-which is-actually-orange, but the data probably went the way of flight MH370, i.e., south...

AlanSApril 27, 2018 5:58 PM

@Petter

Spanish government is seriously considering setting up a "Ministry of Truth" (à la 1984)
Ironic given that the author of 1984 had his defining life experience in Barcelona. One must also remember that Spain is not Germany. Franco didn't die in a bunker.


AlanSApril 27, 2018 9:30 PM

@Ratio

Quite. Orwell narrowly survived his encounter with both in Catalonia.

FrancesApril 28, 2018 12:29 AM

I offer another fine screw up in the Canadian government's new payroll system, called Phoenix. Phoenix was set up by the previous government of Stephen Harper and something like four years later, it still isn't working right.

RealFakeNewsApril 28, 2018 3:41 AM

Uhh... that should have read "NTSB-style". I'm sure you knew that though. :P

Point is, too many software projects fail in this way, and no-one is seemingly held accountable, apparently no investigation is held into why it failed, and the software industry as a whole has no reference on what happened, why, or how to avoid such problems in the future.

The reason bodies like the NTSB exist is to determine why something happened, so everyone can be just a little wiser in the future.

I see software as being no different from structural engineering, or other safety-critical area. It's highly complex, yet there seems to be a total lack of interest when it goes wrong beyond fixing the immediate problem.

Just look at some of the posts here in this thread: "it was a management problem". That does not adequately explain the totally outrageous behavior of the software leading to major systems failure.

Just A Guy posted what appears to be the systems in use in this failure scenario. It reads like a list of buzz-words that someone who has no idea what they're talking about cooked up in some meeting because it sounds impressive.

AngularJS?? The best they can offer is some massive, bloated JAVASCRIPT?

I know there is this obsession with the cloud, and turning web browsers into THE universal program, but you'd think they'd learn by now.

I'm also concerned at them pulling software off git-hub. Imagine an attack scenario: someone knows ahead of time that banking software will use a piece of open-source software. What is preventing someone poisoning the software to leak information?

If they don't have the time to write it themselves, I would doubt they have time for a thorough code-review, either.

Call me nïeve (or a genius?) but how does asynchronous software mean that transactions gets "confused" so a transaction between A and B ends up being between A and C with the transfer amount of D? I just can't begin to understand how that type of failure can happen (I have worked with asynchronous transactions and never had such problems).

Seeing as we appear to have some people with knowledge of the TSB problems, maybe they can shed some light on what happened? If not now, in the future.

Denton ScratchApril 28, 2018 4:32 AM

@RealFakeNews

'Just look at some of the posts here in this thread: "it was a management problem".'

Well, unless the company is a one-man show, it's *always* a management problem. If the staff are incompetent, management's job is to know about that and fix it. If there aren't enough staff, that's a management problem. Management is there to make sure the business operates smoothly. In particular, migrating from an old and hairy legacy system to a shiny, new Javascript/EJB buzzword-based system is almost by definition a management problem.

Of course there must have been a failure of QA; of course they should have had a way of rolling back the change. No doubt there were greybeard techs telling management that without QA and rollback, they shouldn't throw the switch. The greybeards were presumably overruled by management, who were looking nervously at the bill that Lloyds were presenting for another year's rental of the Lloyds core banking system (I can't find the number right now, but I think Lloyds were charging Sabadell like £70 million a year).

Techies are not hired to manage banks.

I'm not sure what your point was about AngularJS, but I think it is an appalling idea. If they are using AngularJS in their frontend, then it would be no use to me anyway. I distrust scripts. But most of that buzzword bingo stuff is to do with the frontend; it wouldn't be such a big deal if this was just a broken user-interface. But what Sabadell have tried to do is to migrate the TSB accounts to a brand-new backend, based on their own in-house backend. My guess is that this modified Sabadell backend is what has caused all these problems.

By the way, does anyone else share my distaste at the idea of having their banking transactions processed in AWS?

Douglas L CoulterApril 28, 2018 8:42 AM

I'll pretend to be an utter noob for a moment here.
To those suggesting the two systems be run in parallel for awhile...presumably so that mistakes made by one can be detected and/or resolved via comparison with the other...

Is not the very nature of this what a database expert would call an atomic transaction? Which would you trust in a disagreement, and now to add a layer of checking on this fast enough to prevent money sent to the wrong place from getting there (where possession is 99.9% of the law)? This is a classic example in database books when transferring money from one account to another within even the same customer - an error can generate a gain or loss for that customer (and the reverse for the bank).

Seems to me some extra-magical thinking is involved in "just do this trivial thing", and it's pretty much evident every time the word "just" is used, though especially egregious here.

The Space Shuttle needed more than two computers voting...and a heck of a lot of extra logic to count the votes and make real decisions. The same would effectively have just shut banking down to such a slow rate as to be meaningless.

I agree some form of physical money should be kept around, though paper wouldn't be my first choice as fiat printers play games with that all though history (going back even before "coin clipping"). All world "reserve" currencies fail, just a matter of time. One could argue that a lot of the current world thrashing indicates the dollar is on the way out.

The real reasons "they" want to eliminate cash make an interesting list. I'm sure to have missed some here.

1* You can't have negative interest rates if there's cash.

2* You can be fairly anonymous with cash (tech is making that less likely with ocr and serial numbers).
2A* This implies tax evasion and other "crime" is as possible for the little guy as for the big outfit in our increasingly "just-us" system.

3* It's harder to surveil and control people without the metadata easily gathered in compact form with electronic funds transfers. (Interesting that surveil was not in my spell checker...). You wouldn't want to have to have humint to detect and nip a revolt in the bud while the negative publicity footprint can be kept low.

4* On the other hand, effectively trackless money transfers done by governments to promote terrorism and false flags, line pockets in the MIC and so on are far easier with pure electronic money.

5* Printing out of thin air and fractional reserve financialization would be impossible at the current scales if there had to be as much physical money as there is fake stuff - and every dollar "created" by these means makes your holdings worth fewer of them - twice the bucks with the same amount of goods == half the value per buck. Sooner or later that arithmetic comes home to roost - we all know the latest example.

RatioApril 28, 2018 9:00 AM

Warning signs for TSB's IT meltdown were clear a year ago – insider:

When TSB split from Lloyds Banking Group (LBG), a move forced by the EU as a condition of its taxpayer bailout in 2008, a clone of the original group’s computer system was created and rented to TSB for £100m a year.

That banking system was a “bodge of many old systems for TSB, BOS, Halifax, Cheltenham and Gloucester and others” that had resulted from the “nightmare” integration of HBOS with Lloyds as a result of the banking crisis, according to one insider who had extensive access to and intimate knowledge of LBG and TSB’s internal systems over a prolonged period.

[...]

“The time period to develop the new system and migrate TSB over to it was just 18 months,” the insider said. “I thought this was ridiculous. TSB people were saying that Sabadell had done this many times in Spain. But tiny Spanish local banks are not sprawling LBG legacy systems.”

To make matters worse, the Sabadell development team did not have full control – and therefore a full understanding – of the system they were trying to migrate customer data and systems from because Lloyds Banking Group was still the supplier.

“This turned what was a super-hard systems job [into] a clusterfuck in the making,” the insider said.

By March 2017, the nightmare for customers that was going to unfold a year later appeared inevitable. “It was unbelievable – hardly even a prototype or proof of concept, yet it was supposed to be fully tested and working by May before the integration work started,” the insider continued. “Senior staff were furious about the state it was in. Even logging in was problematic.”

But since renting the old system was costing Sabadell £214m in 2017

tfbApril 28, 2018 2:49 PM

Just to add to what someone else has said about Lloyds: it's fairly silly to assume that they would not support a rollback, because they get paid for providing infrastructure to TSB and I'm sure TSB cover any regulatory fine due to not separating their infrastructure.

The person or people who made the decision not to roll back the change, at whatever point on Sunday (long enough before the start of the business day on Monday) that decision needed to be made, was almost beyond doubt within TSB.

RealFakeNewsApril 28, 2018 9:50 PM

@Denton Scratch:

AngularJS: I have no time for any of these JS libraries. I also appreciate much of what was listed was front-end.

I was lost: are they using AWS or Netflix? Does Netflix use AWS? Why the hell are we even using the word "Netflix" in the context of online banking?!

Have software developers lost their minds? I understand the idea behind code re-use, but I think things have now got silly.

I also agree with your point that most issues can be "managerial", but at some point the "rubber has to hit the road" and the people actually doing the work need to do what they can to correct the mess, or prevent it in the first place.

People can blame managers as much as they like - the software devs ultimately created a system that flat-out didn't work properly.

Even if they did test for 500 concurrent users, this STILL fails to adequately explain the oddities in data processing.

RealFakeNewsApril 28, 2018 10:09 PM

@Douglas L Coulter:

To test the systems in parallel, simultaneously, with live data, wouldn't necessarily be hard to do.

1) Copy 10000 customer records

2) Write some code that triggers the equivalent function in the other system

3) Compare

Given we are talking online banking, and regular personal accounts, the individual transaction rate per account will be very low (perhaps 1 every minute or so, for the duration of the session).

I'm not aware of many personal accounts having stock-market levels of transactions.

The fact a customer can't even log into the system without perhaps seeing someone else's account information is beyond comprehension.

This is not only easy to do; just about every system in existence does this! This is akin to a permanent failure of basic math, where 2+2=5 every time.

Let's not lose sight of the actual problem. The size of the system isn't exactly huge, either. Facebook and Google run far larger systems.

I really think people are in denial about what has happened here. That, or they can't believe it actually happened at all.

The worst part is, I'm sure this will happen again in the future, because the basic problems won't be addressed.

We need to cut the "Agile" crap from software development, and move back to actually designing the system on paper before writing a single line of code.

Z.LozinskiApril 29, 2018 4:59 AM

There is an aspect of this problem that I have not seen described in the press coverage. TSB is a UK bank, and its products and processes are standard UK retail banking products. (Think thousands of individual products). TSB was part of Lloyds TSB until the demerger, and carried on using an instance of Lloyds TSB's core banking system. Sabadell is a Spanish bank.

I had accounts at two different UK financial institutions that were taken over by a Spanish bank over the last 10+ years. One was Abbey National (a UK retail bank), the other Alliance & Leicester (a UK building society, or Savings & Loan for US readers). Both were taken over by Santander. One of the things that is noticeable is that Santander runs its UK business using Spanish banking processes, presumably as it wants to use the same core banking system globally. One of the effects is that banking services that are standard in a UK retail bank are no longer available from Santander, once they switched to Santander's core banking system. An example: in the UK you can pay household bills at branch of your retail bank using a cheque and giro-slip (a pe-printed routing form that directs payment to the account of the electricity company). You can't do this at Santander. I also noticed some interesting behaviour with my mortgage (home loan) but that might be down to staff training.

Speculation: Has the mismatch between complex UK financial products and a core banking system designed to offer a set of simple (so low cost to serve) products been part of the problem?

And the security angle is when you try to map processes across systems designed with different assumptions. We will see more of this with mergers and acquisitions, and especially cross-border M&A.

Disclaimer: As Clive worked out many years ago, I work for IBM (who have been retained by the TSB) but I have no direct knowledge of this engagement.

RealFakeNewsApril 29, 2018 5:10 AM

@Z.Lozinski:

Speculation...

While one could argue for it being a reason for minor glitches or problems, that doesn't explain basic failures such as people logging in and seeing the wrong account.

Something serious is broken at a very basic level.

albertApril 29, 2018 11:35 AM

@RealFakeNews, etc.

Re: NTSB. It's unique, IMO. A thorough and well-run gov't agency. Why? Because it is toothless. It cannot determine policy, or punish anyone. I'm not aware of any gov't agency that's capable of 'NTSB-like' investigation of software disasters*. Such investigations are the province of private corporations, and unlike the NTSB, they don't have websites where you can read the records of an investigation in fine detail. I would suspect that contracts for 'investigations' of software anomalies include draconian NDAs. "We fixed the problem but we can't tell you what it was." Like many other software-based systems, banking systems are bunches of black boxes, strung together with fingers crossed and votive offerings to the goddess of computerization (whom I suspect was Pandora**).

----------
*or anything else, for that matter.

. .. . .. --- ....

Z.LozinskiApril 29, 2018 2:23 PM

@RealFakeNews,

I agree that there is something broken in the implementation.

My point was that if you take a system designed on the assumption of simple banking processes and customer interactions and try to fit a more complex set of requirements onto it, you will be performing additional processing, and generating more load. Experience shows that this is when systems break. Web Application Servers, caches, load-balancers all perform poorly if the offered workload is significantly greater than they were dimensioned to process.

Equally, someone may just have mis-sized the system. Now, you would hope this would be spotted during performance and load testing. Maybe it wasn't

ATNApril 30, 2018 4:02 AM

@ RealFakeNews:
> Something serious is broken at a very basic level.

Unlike Javascript programmers and managers will tell you, you cannot get a random piece of software (for free on the net) and make it work by testing it and fixing its bugs.
When you ask a Cobol programmer to add two numbers, he will ask you why, he needs to understand the context: what shall software do if it is made to credit a negative number, what if it credits $100 billion , what if it credits $0.00 ?
Only software made to be tested can be properly tested, and usually you will find on the net the software itself under a liberal license - the software's testing tools are not released (but may be available under another license).
Javascript programmers will usually add two numbers without asking any questions, for a lot cheaper than Cobol programmers.

GweihirApril 30, 2018 9:31 AM

This is not a problem you can test for, except with the actual migration. One issue is that only a small number of customer records may cause problems. A second is that you cannot simply copy data over to the target system and have it "live" there as well.

Hence what you do is, you make _very_ sure you can roll the change back. But that costs money and the ones making decisions about these things are clueless and under pressure to save money. I have personally been involved in convincing a bank that is a bit larger to not do a migration to a new e-banking system were they a) had no way to roll back and b) manual processing (their emergency plan) would have needed months to ramp up. The second because the people in charge were not able to run simple numbers. In addition, the target system was not even ready a few weeks before the migration and forget about "well tested" or the like. To add to that, the "test environment" different enough from the productive system to make tests there pretty useless.

This was an existential risk for the bank but it needed a lot of convincing for them to see that. Of course, once they understood, the migration was off. The staggering thing about these things is how little money taking these extreme risks actually saves in the end. Often it is only a few millions or even far less.

The root-cause is a complete non-understanding of how things work by the decision-makers, often amplified by a culture of "shoot the messenger" that eliminates the people that understand how things work and are willing to speak up. On the plus side, TSB is now a nice "reference catastrophe". Even better if they die as a result, preferably fast (cannot fix the issue), but I will settle for slow (customers all leave).

TRXApril 30, 2018 4:20 PM

The doubleplus ungood part of this is, that's a textbook example of a system that could be moved over in stages. Export three or four customers, convert, import to the new system. Then thirty or forty. Then all the "Aa". Then the rest of the As, and each letter in turn.

I sympathize with the customers, but I have no sympathy with the IT people at all. I used to work for a "medical management" company that did IT for anything from individual practices to small hospitals, and I personally did the data conversions on several of them.

It ain't rocket surgery. Just industrial-grade incompetence.

HMApril 30, 2018 11:16 PM

@TRX

Definitely seems like it would have been less pain if they'd moved people in stages.

Also, the reports imply there is a new backend as well as a new internet banking frontend that is having login issues. If so it seems it would have been better to migrate in stages, i.e. new internet frontend but if the system used by people in the branches had not changed at the same time then people having internet banking problems could get correct answers in a branch for now.

One other comment is that one of the news reports said staff had been running the new platform since November, but many of the reported problems seem to be with business banking. Did TSB test on it's staff, but presumably few if any of it's employees have business accounts because they are employees?

echoMay 1, 2018 12:51 PM

@ATN

There has been a multi-generation change from data processing, to hacking, to frameworks. There have also been a lot of shifts from finance, to hackers, and back to management over this time. Different people care about different things.

@TRX

I agree. Losing touch with the basics seems to affect a lot of organisations. My sense is the office politics and being insulated from the impact may be part of why. "It's just a job." A means to an end...

Leave a comment

Allowed HTML: <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre>

Photo of Bruce Schneier by Per Ervland.

Schneier on Security is a personal website. Opinions expressed are not necessarily those of IBM Resilient.