When our deploy filled the hospitals and ended up on national news.

niclasJan 6, 2024

Maja and I in december 2013. I look tired in this picture, and you are about to read why.

You are about to read a story about an event that occurred 10 years ago. Our software deployment ended up on national TV and in news articles as a ‘healthcare breakdown.’ Patients had to stay longer in hospitals during Christmas. The hospitals were getting full – all because of us.

Introduction

Exactly 10 years ago, back in 2013-2014, I was part of a software engineering team both managing and developing a healthcare system used in southern Sweden. At its core, the system essentially created and managed the handshake between state and municipality when patients transitioned, for example, from staying at a hospital to returning to a care home.

I have many stories surrounding this system and our great team. This time, I’ll tell the story of how a bug and multi-layer outsourcing caused havoc in healthcare.

The Christmas deployment

We had been working on a set of new features and were doing one last deploy before the Christmas holidays. We were working with a classical non-cloud setup with TeamCity, MSSQL, BizTalk, Windows servers, and the usual stuff back then. As per usual, we had tested everything in the different environments, and this time we even had two interns executing different testing scenarios before the release (I’ll probably write another article on why this was really bad practice). All looked fine.

The morning after: Chaos unfolds

Anyway, we went ahead and deployed to production. We followed a strict process, and everything looked good after the deploy. No errors from the database, application server, and the integration platform BizTalk was responding as intended.

The nature of the system was that it was used heavily during daytime and almost not at all during evening and nighttime. So, we deployed, performed rudimentary testing, and went to bed, waiting for the many, many users to log on.

The morning after our deploy, reports started to come in. And they were not good. In fact, they were catastrophic. When dealing with systems like this, it is absolutely crucial that everything works. In the end, there are real people impacted. It’s not like e-commerce where issues might mean lost sales; here, it affected people’s health and well-being.

Emergency measures

We decided to shut down the system. It was probably the first time we did this, and it felt terrible. In healthcare, everything has a redundancy process, and in this case, it was fax. The nurses and other healthcare staff now had to work with old fax machines to get work done. I felt ashamed. We all did.

Diagnosing and addressing the Issue

We worked around the clock and found the serious bug fairly quickly. After major testing, we decided to deploy an emergency fix to get the system up and running before the Christmas holiday.

Remember, if the handshake doesn’t work, the patient can’t go home. Celebrating Christmas in the hospital due to our bug? Not on our watch, or at least we thought so.

The unexpected roadblock

We initiated the deploy. Staging went fine, production not so much.

The server logs immediately showed our deploy failing. I’m not sure we had even seen anything like it. The exact same package and procedure were successful in all other environments. We started to trace and debug everything we could get our hands on.

Outsourcing, outsourcing everywhere

Article on how patients that had finished treatment was not sent home due to our system being down.

There was something stopping us, though. We had trouble tracing the network traffic. At this time, the healthcare IT environment was outsourced on multiple layers. DB and servers were one supplier, BizTalk and integrations were another, and the network was a third. During this time, 2013-2014, this was common with our different clients, and we made it work.

However, multi-layer outsourcing while approaching the holiday period turned out to be a small disaster for us. While we saw that the production servers were not to blame, we had trouble accessing DB, BizTalk, and network logs. We started to work through the different outsourcing partners. Many people were on holidays and not responding to our requests. Eventually, we estalished contact with two system admins which offered to come to our office and works with us.

When the system administrators arrived at our office, the difference in expertise was clear. They lacked the deep understanding that we had developed over time. Eventually, it came to a point where we were at their side, guiding them command by command on what needed to be done. Looking back, I still don’t understand what our client gained from relying on these system administrators instead of allowing us full access to manage the situation.

Eventually, we had everything we needed. Gathering the data from all the involved parties took many, many days. It was extremely stressful.

The problem was, we couldn’t spot the problem. Not even with all these logs and all the outsourcing partners helping out. The stress building up was insane. And as icing on the cake, not only were local news outlets covering our botched deployment, by now it was also on the national TV news.

Back home, I had a 6-month-old baby waiting and I was already lacking sleep since a few months. I couldn’t handle all the pressure and had to take a few days off in the middle of all this chaos. At the time, I felt embarrassed. Now, I am proud that I said how it was and my manager fully understood and supported me.

One of my colleagues kept on pushing all through Christmas Eve and even New Years Eve.

Remember, we had the emergency fix, we just couldnt get it deployed. It was like having the key to unlock the door and make our problems go away, but the lock had suddenly changed.

We investigated everything. It was tempting to a full re-install of the servers, however we did not have permission or mandate for operations like that. Admin was outsourced. We started to look at server settings, library versions and much more.

Side note: In these situations, if you start to change the production environment and don’t immediately do the same on staging, you will have lost staging. Doing it by the books means you first change staging and then propagate to production. The environments would simply not be the same if we started to tinker. The whole point of staging is to have a reflection of production but in a safe space. No real users. In tough situation it’s always tempting to try to fix production and later on mirror to staging. Decisions like these are typically very stressfull for engineers but often not noticed by leadership.

Article about our system downtime creating lack of space in the hospitals

The firewall breakthrough

After a while, we realized we had missed one layer in the infrastructure. We had never thought of it as an active compenent, it wasn´t even visible in the network diagrams supplied to us. While we had the network logs, we did not have any knowledge on the internal network firewall. Once we understood this, we started to investigate which outsourcing partner was operating the firewalls. During the investigation, we could see that a new packet inspection algorithm had been installed in between our deploys! And only in the production infrastructure. Can you imagine. I still cant believe this happened.

Essentially, it was like this: Between our deploy with the bug, and our second deploy with the emergency fix, the packet inspection had been applied to the production infrastructure. This was maximum bad luck for us as it was really close in time. The scenario was simply unimaginable.

The packet inspection mechanism flagged our package, cut it in half, and let it travel to its destination. I have never seen this behavior before or after. It was probably a misconfiguration.

Now, the hunt for an outsourced firewall technician not being on holiday began. Eventually, we got hold of someone who could help us by adjusting this new packet inspection algorithm.

Once the packet inspection issue was resolved, our emergency fix was deployed successfully. The system was back up, the critical bug fixed, and healthcare workers could finally abandon their fax machines.

Conclusion: Lessons learned

This experience was rich in lessons learned. It highlighted the dangers of fragmented outsourcing and underscored the importance of having technicians who have a comprehensive understanding of the entire infrastructure.

It also strengthened my belief that separating responsibilities between different layers adds complexity without much added value. This includes the division between software engineers and deployments through extensive CI/CD solutions, now regarded as best practice. Due to this, I’ve observed countless instances where engineers resort to sending thousands of Slack messages, querying system engineers for logs, traces, and more. I remain unconvinced about the advantages of this kind of separation.

For me personally, this experience has reshaped my perspective on work. I am now much better equipped to handle challenging situations, and I believe this is partly due to what I learned from this incident.

For example, when I later worked in e-commerce and saw product people stressed out because they were missing analytical data due to an error from one of my teams, I often reflected on this earlier incident. I would think, ‘At least they don’t have to revert to using fax machines. We will probably be just fine.’

I have omitted names and certain details for confidentiality, but I hope you still enjoyed this old story of mine.

When our deploy filled the hospitals and ended up on national news.

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

Related Posts

Democratizing APIs with AI Assistance: A Step-by-Step Guide

My career and search for new opportunities

Building something 100% with AI? Welcome sommarlistan.se

Leave a Reply Cancel reply

Recent Posts

Archives

Categories