Facebook Is Down

Facebook—along with Instagram and WhatsApp—went down globally today. Basically, someone deleted their BGP records, which made their DNS fall apart.

…at approximately 11:39 a.m. ET today (15:39 UTC), someone at Facebook caused an update to be made to the company’s Border Gateway Protocol (BGP) records. BGP is a mechanism by which Internet service providers of the world share information about which providers are responsible for routing Internet traffic to which specific groups of Internet addresses.

In simpler terms, sometime this morning Facebook took away the map telling the world’s computers how to find its various online properties. As a result, when one types Facebook.com into a web browser, the browser has no idea where to find Facebook.com, and so returns an error page.

In addition to stranding billions of users, the Facebook outage also has stranded its employees from communicating with one another using their internal Facebook tools. That’s because Facebook’s email and tools are all managed in house and via the same domains that are now stranded.

What I heard is that none of the employee keycards work, since they have to ping a now-unreachable server. So people can’t get into buildings and offices.

And every third-party site that relies on “log in with Facebook” is stuck as well.

The fix won’t be quick:

As a former network admin who worked on the internet at this level, I anticipate Facebook will be down for hours more. I suspect it will end up being Facebook’s longest and most severe failure to date before it’s fixed.

We all know the security risks of monocultures.

EDITED TO ADD (10/6): Good explanation of what happened. Shorter from Jonathan Zittrain: “Facebook basically locked its keys in the car.”

Posted on October 4, 2021 at 5:55 PM63 Comments

Comments

Steve October 4, 2021 6:01 PM

If one were of a conspiratorial mind, one might be of a mind to suggest that this was an intentional “shot across the bow” by Facebook to remind the world just how “important” it is and the woe that would befall those who dare to tinker with it.

Etienne October 4, 2021 6:25 PM

I had a college professor who told us, if we were going to fail, and make mistakes, it’s better to do it now, then later in life.

Facebook would have been better off to have had this problem in their youth. To have it happen now just makes them look stupid, and stupid in a way that you don’t want to meet them at a bar anymore, or let your daughter date them.

The gene pool is all we have, and Facebook is another example of why cousins shouldn’t marry.

Paul Demers October 4, 2021 6:34 PM

Maybe that someone who made the update watched the 60 minutes story last evening on the Facebook “whistle blower “ and decided they had enough of the corporate coverups as well.

The Real Q October 4, 2021 7:07 PM

The real question is, what entity pulled the plug? Perhaps an Alternative Intelligence? 😛

- October 4, 2021 7:16 PM

@Bruce Schneier:

“And every third-party site that relies on “log in with Facebook” is stuck as well.”

It is also in effect a DDOS of the DNS system.

@ALL:

Host names are not real, they are translated to IP addresses, and it is the IP addresses that are used to route traffic. The translation is done by a glob spanning database called the “Domain Name System”(DND) which is a hierarchical series of servers.

To keep the load down on the DNS servers and reduce traffic on the netwoek each host keeps a local cache of host names to IP addresses and it is this cache where most translations actually take place. The length of time a translation stays in the cache is dependent on several factors, but usually it’s 20mins as a minimum.

So now imagine what happens when a host name can not be resolved. Importantly it’s not in the hosts local cache so it has to send out a DNS request across the network.

If your host can not resolve the host name to IP address, every time you click on a link to that host name out goes a DNS request.

Whilst that is bad enough, consider all those web pages you can get to, that also have those “facebook buttons” your browser just loading such a button can trigger a DNS request…

That’s way more DNS traffic than would be normal.

So the likes of Google and Clodflare get effected because they host the commonly used DNS servers…

Count0 October 4, 2021 7:34 PM

For what it’s worth as someone who has installed and programmed a lot of access control systems (AKA Badge readers) the panels are supposed to revert to internal memory to allow or deny access during network or server outages, cache the transactions and upload them when the connection is restored. That’s been the standard for literally 30 years. If their (FB) access control system is so poorly installed or managed that the paneles can’t make local decisions during outages, they need to fire their contractor and install a better system.

Rick Moen October 4, 2021 7:35 PM

This bit, from @cullend, made me as a DNS admin laugh hard:

“facebook.com is registered with
“registrarsafe.com” as registrar. “registrarsafe.com”
is unreachable because it’s using Facebook’s DNS
servers and is probably a unit of Facebook.
“registrarsafe.com” is registered wtih
“registrarsafe.com”.

(I think their headquarters street address — the former Sun Microsystems
campus — needs to be changed from One Facebook Way to One Single Point of
Failure Way.)

V October 4, 2021 7:56 PM

The New York Times article Bruce links to has a ‘log on with Facebook’ button. I’ll just read the story from the papyrus version tomorrow morning.

Justa Techwannabe October 4, 2021 8:22 PM

Now I know why I couldn’t login this afternoon.
As of now,it seems ok.
Yes I have a FB account. No I don’t give my info.

lurker October 4, 2021 8:53 PM

@SLF, All
I’m probably missing something, but isn’t this yet another reason for client side caching of DNS? I used to use pdnsd on another OS, set your own ttl, cache stored on disk, but it hasn’t had any TLC for too long so I’m reluctant to compile it for this OS.

SpaceLifeForm October 4, 2021 8:53 PM

@ Bruce

BGP should be a tag here.

Because the main problem was BGP, not DNS.

DNS failures was a fallout problem.

Weather October 4, 2021 8:57 PM

Bgp sends out update packets every 30mins, if a path metric changes like slower link(cat6, fiber,dial up) or longer hops it changes the table to a new route ,it pass the info approximately 5 routers each 30 mins. They seem by nmap scan to run redhat UDP 500 .
Last time I checked Cisco, juniper routers were around $20-40k.

Even Facebook can do something right now and again 😉

Alex October 4, 2021 9:06 PM

Quite a shame that it was only down for a few hours. If it was down for a week, people might learn to actually be social again. Maybe they might even talk to one another…with their voices…or even better, face-to-face!

Who the hell uses the same servers for internal apps (ie: security, access control, HVAC, etc.) as external/public-facing? Especially at THIS level? I’m working for a small business and we keep all of our public-facing stuff in a completely separate facility.

…and who the hell uses Login with Facebook? Don’t they have enough access to your personal details? Login with Facebook has been a huge blessing to law enforcement and other snoopy types.

@Count0: I’m not surprised. Tech/upstart type companies and millennials think everything traditional = bad!! and they’ll roll their own with Java in the cloud. Of course, everything must have an iPhone app, especially when a regular webpage will do just fine.

Apollyon October 4, 2021 9:26 PM

This incident is proof that many people can be stranded and shut off from work, family and friends. It is clear that Facebook needs more oversight to ensure that their users won’t have to suffer such a widespread outage due to error or in-house sabotage. That WhatsApp and Instagram also went down adds insult to injury, from within.

SpaceLifeForm October 4, 2021 9:55 PM

@ lurker

The Great Hack Back

Overriding DNS TTL via your local DNS caching DNS resolver (ex: pdnsd, been there, done that), or an actual DNS recursive DNS caching resolver (ex: Unbound, been there, done that) would not help in this situation at all.

The core problem was the withdrawal of the BGP routing information.

Even if you cached the ip address of the server, your upstream routers were blind as to how to route the packets. The routers no longer knew where to send the packets.

This was ultimately a BGP problem, not a DNS problem.

Facebook has now royally screwed up. They were so concerned to get back online, they failed to do proper IR.

They are pwned. It’s a good thing (TM).

- October 4, 2021 11:14 PM

@SpaceLifeForm:

“The core problem was the withdrawal of the BGP routing information.”

So they wiped themselves off of the map quite literally by failing to say ‘Hi to the world’.

“Facebook has now royally screwed up.”

That I think counts as the understatment of the week, and it’s only Monday.

“They were so concerned to get back online, they failed to do proper IR.”

Ahh… Yes, potentially ‘after shocks’ to come, but also it might well hide the ‘Real Why’ of what happened. After all was it an insider or outsider? intentional or accidental?

Whilst it all has a funny side, I suspect being a fly on the wall at Facecrooks internal incident review would be entertaining.

Perhaps an OsInt of LinkedIn might reveal who’s finger on which button caused the oopsie-downsie that knocked the parrot off of the perch.

Unless of course, if it was actually an outsider doing it intentionally… In which case that is one heck of a lot of bragging rights, if you can stay ahead of the snatch squads and rendition teams.

SpaceLifeForm October 4, 2021 11:32 PM

LOL

hxtps://engineering.fb.com/2021/10/04/networking-traffic/outage/

Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.

[Translation: This is how we are spinning this, because we have no clue]

SpaceLifeForm October 5, 2021 12:48 AM

Classic

Everybody wants…

hxtps://www.youtube.com/watch?v=aGCdLKXNF3w

[You may think this is way off topic. It is not. Dots]

SpaceLifeForm October 5, 2021 1:28 AM

@ lurker, -, ALL

This incident was a Seismic Metadata event. Big picture, tiny picture.

Let’s hope those that care about the future are paying attention.

SpaceLifeForm October 5, 2021 2:00 AM

@ lurker, -, ALL

I’m not seeing much discussion about how Facebook shot both of their feet.

Techs have to go to their core datacenter to restart.

Why?

Because there was no one inside. Covid-19.

And because they were stupid.

They put all of their eggs in the same basket.

They ran their own DNS COMPLETELY on their own network.

They created their own DNS and BGP Co-dependency.

They could not fix the BGP problem remotely, because DNS was dead.

They could not fix the DNS problem remotely, because BGP was dead.

They only way to fix was to access routers via ip address, and the only way to do so, was to be on LAN.

The call came from inside the house.

SpaceLifeForm October 5, 2021 2:34 AM

True words that were never spoken

hxtps://twitter.com/BasementTrix/status/1445185892323414025

It also pays to have USB-A-to-Mini-USB (with a Micro-USB adapter), an Ether cross-over, and a null-modem cable just in case.

It has to do with being physically next to your router. And how the FB techs had to get to the datacenter to get things working again.

Ever use a floppy and a null modem cable to recover your router?

lurker October 5, 2021 4:23 AM

@SLF
Ever use a floppy and a null modem cable to recover your router?

Nope, but I once had to make an ether crossover, because the people onsite who should have had one, didn’t.

1&1=Umm October 5, 2021 4:40 AM

@SpaceLifeForm

Ever use a floppy and a null modem cable to recover your router?

Back in the 90’s you did not have much choice especially with 486 based *nix systems and Spark based systems. I can remember sitting there with a box of 80-130 floppy disks building / rebuilding systems from bare bones hardware it was one way to loose a day of your life, each floppy in a set taking 90secs to be read in then fingers crossed you typed the incantations correctly. Especially for ‘hardening’ one slip there and around you went again…

Then unless you had a ‘real network connection’ having to connect via modem to download the latest security patches… So you needed tools and a telephone patch set as well…

Even in the mid 2000’s Sun boxes often had to be brought back up with a crossover cable and terminal. But rather than a floppy by then you had to plug in a CD-ROM drive, unless you wanted to carry your own “network device” around with you… I remember building such a network device out of a high end IBM laptop that also did duty as a terminal… So both a serial crossover and Cat5 crossover cable.

There are still times when this is the only way, and a good Sysadmin will have a ‘ring binder’ or ‘spring file box’ with all the documentation and cables and disks in it sitting on the shelf. Even a little 4port network hub oh and these days a USB hub and all those USB dongles and adaptor cables and even shock horror a WiFi AP/router. Oh and instead of a modem, a 4G LTE phone or Mobile Broadband dongle…

So yeah, sometimes still ‘same 5h1t different technology’.

1&1~=Umm October 5, 2021 4:45 AM

@ Dave

And nothing of value was lost.

Because the reality of FB is, nothing is of value to the actual users, nor for that matter the shareholders.

Peter A. October 5, 2021 4:56 AM

  1. I wonder who gets to change the industry he’s working for…
  2. Funny thing, I’ve just had a serious conversation with our soon-to-be-adults about FB, Insta etc. and THIS happens! Panic ensues amongst the younger members of the family (I was out at work and hadn’t a chance to notice), but fortunately the event was mentioned in the news on FM radio. Ufff… the world is fine, it is but a glitch. Dad was right, right?
  3. To err is human, to screw up royally is… whatever. Aaaah, the old days as the univ’s BoFH… I wanted to have ‘order’ in /etc/passwd by sorting it by UID.

sort -n -t: -k3,3 passwd

Can you see what’s the problem here? DON’T DO IT AT HOME! And I did not have a backup of the passwd file… >10k entries gone…

THINK! QUICK!!! ok, shutdown NOW and boot to single user mode before anything gets to overwrite disk blocks that once were the passwd file. It was on old Ultrix with funny fixed partition schemes. I dded the root partition to a file on usr (fortunately I had enough space), then grepped for passwd-format lines with a largish regexp. Took 20 mins. Then I sorted the file (properly this time!), then removed duplicates and rubbish by hand. Half an hour later I was back up and running. It was a smallish system even by then-contemporary standards.

MartinZ October 5, 2021 8:00 AM

I believe this will explain everything… from FB Engineering in May ’21

“To achieve the goals we’d set, we had to go beyond using BGP as a mere routing protocol. The resulting design creates a baseline connectivity configuration on top of our existing scalable network topology. We employ a uniform AS numbering scheme that is reused across different data center fabrics, simplifying ASN management across data centers.”

“To support the growing scale and evolving routing requirements, our switch-level BGP agent needs periodic updates to add new features, optimization, and bug fixes. To optimize this process (i.e., to ensure fast, frequent changes to the network infrastructure to support good route processing performance), we implemented an in-house BGP agent.”

https://engineering.fb.com/2021/05/13/data-center-engineering/bgp/

1&1~=Umm October 5, 2021 8:06 AM

@ Petre Peter

There are no impeccable records anymore.

There never were, nor can there ever be realy.

@ ALL

What did Facebook do wrong?

Well the list is long and some of it is caused by trying to do the impossible (be 100% secure).

As a result they lost their way in complexity their security archtects did not understand sufficiently.

There is an old joke about,

‘The way to remove a single point of failure, is to form a committee and have each bring their own point of failure with them’.

An imperfect solution to getting towards 100% availability / reliability / security / etc is to avoid “chains” and use “nets”. Chains fail at the weakest link no matter how strong the other links. Nets on the other hand generaly only partially fail and in little ways.

When it comes to communications those who need high availability / reliability / security / etc have a basic way to cover it, and it is called PACE for,

P, Primary
A, Alternate
C, Contingency
E, Emergancy

The problem though is that PACE can go horribly wrong due to the likes of hidden dependencies, or worse an “out of plan” fall back during training / testing like mobile phones, that create a “dependency” in peoples minds[1].

That is if every time something goes wrong with the plan during training / testing you pick up the mobile phone, what are you going to do when the mobile phone network goes down as well?

Remote or distance working, is incredibly fragil, it’s something that people have been ignoring during COVID lockdowns. It is entirely dependent on two things,

1, Communications which fail
2, Planning in detail

The first is going to happen in an emergancy for a whole host of unavoidable reasons.

Which is why the second is realy the only thing that will get your objectives met.

Things change all the time, many are beyond your control, but not your mitigation, providing you are aware of them. It’s why the likes of the military run very realistic war games from pushing small pieces of paper around a table top through to full blow boots on the ground deployments.

If you do not

1, Observe,
2, Analyze,
3, Plan,
4, Train,
5, Test,
6, Reformulate,

As a repeated cycle you are actually ‘Planing to fail’.

[1] http://www.satelliteevolutiongroup.com/GMC/articles/Pace-March21.pdf

Freezing_in_Brazil October 5, 2021 9:37 AM

@@ SpaceLifeForm

Techs have to go to their core datacenter to restart.

They scrambled jet planes left and right to get teams together. Separation of duties gone too far?

Anonymous October 5, 2021 10:53 AM

To those joking about “nothing of value was lost”: Keep in mind that aside from Facebook proper, Whatsapp being down is a serious infrastructure failure to many countries across the world. Many businesses use that as their primary means of communication, and in some places even emergency services rely on it.

Bear October 5, 2021 11:34 AM

I’m local to the Bay Area, and from a social circle that includes a few Facebook employees I’ve heard that with keycards not working and unable to get through to administer their machines remotely, they were forced to use an angle grinder to cut open a door and gain access to the machines that needed to be reconfigured.

This is an example of a type of mistake that happens fairly often in safety and security planning; it relies on systems that will be impaired in the event that these contingencies are needed. The same thing has happened in other ways elsewhere.

Fukushima for example had the emergency pumps that were supposed to protect it from flooding, located below sea level…. like the pumps in New York when Sandy hit and the pumps in New Orleans when Katrina hit. Emergency evacuation plans in New Orleans and Galveston relied on buses that were kept in a low-lying lot that was among the first places flooded. Emergency communications in a lot of places relies on power that is likely to be out in an emergency. Or on internet connectivity that won’t be connected when the microwave towers lose their power or their wired connections. In more places than I want to count hospitals have failed because their emergency generators were either in the basements where they got flooded by the same storm that cuts off power, or exposed on the roof where wear and exposure accumulated until they couldn’t start.

In cases like Facebook, it’s comical; they have to go get an angle grinder and everybody has a laugh. It’s not like Facebook is doing anything important. But in a lot of places this failure to understand what ’emergency planning’ means about the infrastructure that the plan has available to rely on gets deadly serious.

1&1~=Umm October 5, 2021 12:13 PM

@ Bear

Fukushima for example had the emergency pumps that were supposed to protect it from flooding, located below sea level…. like the pumps in New York when Sandy hit and the pumps in New Orleans when Katrina hit.

Pumps are almost always a bad example to pick.

Because although they can pump water up quite high, they can not suck water up more than a few meters (vaccum hight of water is just over 30ft depending on where you are). So if the normal tide hight range is say six meters then the pump is going to get wet from time to time.

That is you can not put a pump very far above your bottom sump level.

It’s why many pumps are designed to be submersible, or are cased into water tight compartments.

MarkH October 5, 2021 12:23 PM

@Peter A:

A great “data panic” story!

We can do in a few seconds, what may need an enormous labor to undo — if it can indeed be reversed.

I’ve been there. It’s the magic of InfoTech.

MK October 5, 2021 1:19 PM

Fukushima’s problem wasn’t with the pumps. It was with the generators, which were placed at ground level to protect them from earthquakes (but not tsunamis).

In the 1989 California quake, one radio/TV station had their generators on the roof, which was OK, except that the pumps to move fuel from the basement to the roof were not on the emergency power bus.

Sometimes the first test of a system is when it is actually needed.

Aaron October 5, 2021 1:21 PM

@Count0
My thoughts exactly!

High tech gadgets are fun and fancy but low tech gadgets work and stay working for a reason; dumb tech can, when it’s most important, get the job done reliably and without external reliance.

If your badging/security system is so integrated that an external internet connection takes it offline, you’ve got serious issues not only from a physical security stance but from, as witnessed, a fail-safe issue in which the thing designed to protect/secure your infrastructure has now become a determent to your operations. I’m wondering if their physical security personnel even have physical (metal) keys or if their doors have them as a fail-safe.

On further thought, IF an external network connection downed their badging system this means there is potentially a path from outside FB networks into their badging system. Is there security for their security?

Jon October 5, 2021 3:08 PM

@ Aaron

I’m wondering if their physical security personnel even have physical (metal) keys or if their doors have them as a fail-safe.

Of course, that then introduces another weakness – the lock can be picked, or someone can get a key they shouldn’t have, and avoid the electronic system.

Of course, there’s ways around that too, but they get expensive rather quickly. And adding yet more complexity adds more attack surface. Hmmm… J.

1&1~=Umm October 5, 2021 4:37 PM

@ Aaron

… dumb tech can, when it’s most important, get the job done reliably and without external reliance.

And you’ve missed the real reason “dumb tech” works.

1, It is basically simple.
2, It’s simplicity alows most to comprehend it’s strengths and weaknesses fully.

That second point in effect means most can design a system around them that is not full of “gotchers” hidden by complexity that few if any can understand.

There is a sort of “rule of thumb” about writing “safety system” code, which is,

“All functions fit on a single sheet of print out”.

It generally keeps complexity managable.

Others rules of thumb are,

“No more than three levels of code”

“No code causes exits except by correct design”

“All API’s are human readable”

“ALL interfaces properly deal with errors and exceptions and can unwind them back as far as needed”.

“All errors and exceptions should be regarded as ‘hostile input’ by default”.

Also tend to keep not just complexity down but reliability up.

MarkH October 5, 2021 6:10 PM

@Umm,

re: Dumb Tech, an embedded systems programmer colleague used to hold up quotidian objects (a pencil, for example), and say, “you see this? It just works. Do you know why? It doesn’t contain a single line of code.”

The other running joke was that if anybody came asking for help, before they had a chance to explain we’d ask “does this problem involve software?” If the answer was yes, we’d then ask, “is any of it Microsoft software?”

You can probably guess our response when both answers were affirmative.

re: guidelines — that’s a succinct list of most worthy policies. Much anguish could be spared, if more engineers adopted them.

The single-sheet rule reminds me of Winston Churchill’s rebellion against the wordiness of his bureaucratic underlings. He would ask for reports “on one side of a sheet of paper”; sometimes he would specify a half-sheet.

Clive Robinson October 6, 2021 3:06 AM

@ MarkH,

used to hold up quotidian objects

Quotidian is not a word I tend to use as it is both a noun and an adjective, thus on it’s own can be ambiguous (something “English English” is famous for).

In times past you might use the word to talk of an illness. So “fever” and “quotidian” were synonymous, also “malaria” is a specific “quotidian”.

But more recent usage is to make “quotidian” synonymous with the likes of “episodic”, “every day” and even “banal”. So it can be an “orange orange”.

Which means, “malaria” can be replaced with “episodic fever”, which in turn can and has been a,

“quotidian quotidian”

But… like “buffalo buffalo buffalo” you can replace “double malaria” with “twice daily fever” so replace that with,

“quotidian quotodian quotidian”

And yes, you could go further… but let’s not, as I’m sitting waiting to see a consultant currently about what has become for me a quotidian issue of the medical variety that can give rise to a serious quotidian. Which although normally minor in comparison to other non quotidian issues I have could nevertheless quite easily kill me (and very nearly did not so long ago bacterial sepsis being what it is).

Sut Vachz October 6, 2021 4:51 AM

@ SpaceLifeForm

Everybody wants…

So … you mean … nobody in the video is using FB ? Especially that guy in the green ‘Healey. He is a passenger. As twilight descends as he drives through the desert, he is rambling and listening on the car radio to the music of the spheres appropriate to him

https: //www.youtube.com/watch?v=3tOeITAtyxA

Winter October 6, 2021 5:07 AM

Facebook admits the human error is higher up the command chain than the poor guy who gave the wrong command.

Facebook rendered spineless by buggy audit code that missed catastrophic network config error
Explains mega-outage with boilerplate response: We try hard, we’re sorry we failed, we’ll try to do better

So when the bad change hit Facebook’s backbone, and all the data centers disconnected, all of Facebook’s small bit barns declared themselves crocked and withdrew their BGP advertisements. So even though Facebook’s DNS servers were up, they couldn’t be reached by the outside world. Plus, the back-end systems were inaccessible due to the dead backbone, anyway. Failure upon failure.

ht tps://www.theregister.com/2021/10/06/facebook_outage_explained_in_detail/

Security Sam October 6, 2021 8:50 AM

It seems that Mark has missed the mark
And Zuckerberg hit the proverbial iceberg
Since only the ten percent is above board
The sudden crash threw them overboard.
And all the Monday morning quarterbacks
Attacked the man as if they were sharks.

jdgalt1 October 6, 2021 11:44 AM

This outage was the subject of much discussion and glee on Gab.

Two other Facebook related stories happened in the past week, and either or both may or may not be connected to the outage, the employee lockout, or both.

(1) The so-called whistleblower, interviewed on CBS’ “60 Minutes” and in front of Congress, who wants Facebook to block more material and expel more users.

(2) Someone posted to a hacker forum an offer to sell the personal data of all 1.5 billion Facebook users.

Did the whistleblower take down Facebook to dramatize her story? Did the hacker do it to prove his inside access? Or could the outage be the work of some dissident faction within Facebook that wants to hurt the company?

Pass the popcorn.

lurker October 6, 2021 1:41 PM

@jdgalt1
Hanlon’s Razor might apply here: FB simply demonstrating their inability to understand what they are doing…

SpaceLifeForm October 6, 2021 6:51 PM

@ MartinZ

Good link.

But, it does not explain this incestious Co-Dependency that FB created between BGP and DNS.

DNS servers do not have to be routers. While they certainly can be, and actually many SOHO routers can and/or do so, they should not be making dynamic routing decisions based upon a test failure. That is not normal DNS server functionality. They should let events failover as designed.

What FB is doing, is flat out poor design. And to excuse this screwup on an audit tool with a bug, is nonsense.

My BOLD.

hxtps://engineering.fb.com/2021/10/05/networking-traffic/outage-details/

To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centers, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.

Jesse Thompson October 6, 2021 7:03 PM

I saw that Facebook had an outage on a headline somewhere yesterday before I saw this blog post. Those represent 2 out of the 2 reasons I even realized that such an event had occurred.

In contrast, if a power transformer blows on a rural pole 70 miles north of me, or in some places on the other side of the Cascade range, our telephone status message updates warning about the problem and I get an alert with a map showing me the geographic boundaries of the affected area an average of half an hour before the actual power company can even update their phone message or website that anything is amiss.

Why? Because I wrote a script one afternoon capable of noticing that more than four or more of our mostly residential internet clients fell offline within 60 seconds of one another on more than two transmitters simultaneously, while four or more others remained online. So it draws up a Voronoi diagram of all impacted customers and fires me a kml.

Facebook being down makes about as much difference to me as qq.com being down might to you. That doesn’t make me a saint because both me and my organization would notice if Google were down, but I do not believe that would slow down our business operations by very much either. We do our best not to rely on cloud services for anything we can’t afford to live without, despite that always being a much more expensive path to travel. :/

SpaceLifeForm October 6, 2021 8:15 PM

@ MartinZ

Two other issues were revealed.

First, there should have been someone onsite at the critical datacenter that has experience. Even if they did not know exactly what to do to recover, they should have been experienced enough for someone to explain it to them over the phone.

But, more importantly, there should have been an automatic rollback after 30 minutes unless confirmed by an experienced person.

I suspect that the design is so messed up, that timeout based automatic rollback was never considered. Because it is a HARD PROBLEM when the data is distributed. BGP is not ACID.

But, it is doable. One just needs to consider the scenarios.

30 minute automatic rollback? Or 5.5 hours of global downtime?

This did not age well. My BOLD. Sometimes, slow is safer. BGP is NOT an App.

hxtps://www.usenix.org/conference/nsdi21/presentation/abhashkumar

In this paper, we present Facebook’s BGP-based data center routing design and how it marries data center’s stringent requirements with BGP’s functionality. We present the design’s significant artifacts, including the BGP Autonomous System Number (ASN) allocation, route summarization, and our sophisticated BGP policy set. We demonstrate how this design provides us with flexible control over routing and keeps the network reliable. We also describe our in-house BGP software implementation, and its testing and deployment pipelines. These allow us to treat BGP like any other software component, enabling fast incremental updates. Finally, we share our operational experience in running BGP and specifically shed light on critical incidents over two years across our data center fleet. We describe how those influenced our current and ongoing routing design and operation.

SpaceLifeForm October 6, 2021 10:10 PM

@ ALL

Sumptin, sumptin FB. Note timing

hxtps://taskandpurpose.com/news/uss-kidd-facebook-account-hacked-age-of-empires/

For the last several days, someone has been having a lot of fun playing the classic 1997 strategy game “Age of Empires.” Normally, that wouldn’t be news (the game is freaking fantastic) but in this case someone has been livestreaming their game sessions on the official Facebook account for the USS Kidd, and the U.S. Navy still hasn’t regained control of their account.

“The official Facebook page for USS Kidd (DDG 100) was hacked,” said Cmdr. Nicole Schwegman, a Navy spokesperson. “We are currently working with Facebook technical support to resolve the issue.”

[Good luck]

1&1~=Umm October 7, 2021 12:21 AM

@ MarkH

If the answer was yes, we’d then ask, “is any of it Microsoft software?”

I’ve been looking but failing to find the original source of,

“If the answer is Microsoft, you are asking the wrong question”

It appears to go back at least into the MS Windows 3.1 era and MS-DOS 3.3. Which pre-dates MS-DOS 4 which was somewhat of a ‘lead punt’ that ‘sunk without trace’ very shortly after launch, which caused an uprise in the use of the truism / saw.

More recently perhaps not surprisingly you now see the same truism but with Microsoft replaced with Zoom and just recently Facebook…

But… The search was not entirely wasted it did turn up “The Plan” a lament that once was biblical in it’s engineering usage,

http://lj.rossia.org/users/lolepezy/133621.html

SpaceLifeForm October 7, 2021 1:32 AM

There is a crappy beer that has a Three Letter Acronym.

I do not believe that it is normally available in Silicon Valley.

But, it possibly can be delivered if circumstance warrants.

Richard Burris October 7, 2021 5:13 PM

Really though, the only unfortunate thing about this is that it being down is temporary.

SpaceLifeForm October 7, 2021 5:58 PM

“Facebook basically locked its keys in the car.”

With the engine running.

But, they failed to notice that the windows were down and someone could reach inside.

SpaceLifeForm October 8, 2021 8:54 PM

FB having problems again.

Should not surprise.

hxtps://twitter.com/Facebook/status/1446585486605160448

SpaceLifeForm October 9, 2021 2:46 AM

Event timing

Is it really random?

If you observe what appears to be the same event signature at different times, were they different events?

I do appreciate that Twitter does allow Facebook to have an account. This way, when Facebook has problems, they can use Twitter to communicate.

hxtps://twitter.com/sudambandara/status/1446560653880164352/photo/1

Clive Robinson October 9, 2021 3:46 AM

@ SpaceLifeForm,

FB having problems again.

That’s why they call it “Yo-Yo Mode”. Especially when Yo-Yo has more than one meaning 😉

I guess as with every “quake” the obvious questions are,

1, Will there be aftershocks?
2, How many will there be?

To which the answers are in this case, “Yes” and “The clock’s running”…

Leave a comment

Login

Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.