Episode #3: the story of the petertron4000 ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­  
News, tips, and behind the scenes technical mumbo jumbo  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏  ͏ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­  

We're back! This is kinda fun, right? We tell you the stuff we've been working on, you respond if you need help, yada yada. Good thing we had the totally original thought of doing a newsletter; we really like talking to you.

Okay, here's what's up with us lately:

Product Updates

There's a lot happening on our Fresh Produce site.

The biggest update is: we're building a GitHub Launch UI! Right now, it supports NodeJS, Phoenix, Laravel, Django, Python, Golang, and repos with an existing Dockerfile, and we're working on adding more as fast as we (reasonably) can. If you want to use it right now, head here.

Some other updates for you:

Inside Fly-Ball

If you caught our billing update last time, you should know: we're still focused on money stuff. This month, that means figuring out how to use the resources we already have in order to give you a better Fly.io experience. For example:

Partner News

Feature Story

For this month's newsletter, something a little different, with apologies to anyone who reads the Fly.io Infra Log (you should! it's great! and only sometimes terrifying!), because this ran there first. Fly.io is today pleased to bring you the story of the petertron4000.

One of the features we offer to applications hosted on Fly.io are a dense set of Prometheus performance metrics, along with Grafana dashboards to render them. Not enough of our customers have dug into this; they are really powerful, you can set alerts with them, forward them elsewhere, and, like a customer we're about to talk about, hold our feet to the fire.

One of the things our users can monitor with these metrics is TLS handshake latency. That is: when browsers make connections to our Anycast edge, and negotiate TLS, we track the time it takes to do that handshake (the negotiation, the certificate, the key exchange, etc). Why do we expose TLS latency metrics? We don't know. It was easy to do. Our metrics are very powerful and we take little credit for it because Prometheus, Vector, and Grafana are doing all the lifting.

Anyways, a consequence of exposing TLS latency metrics is that customers can, if they’re watchful, notice spikes in TLS latency. And so someone did. And they noticed something weird: a pattern of slow TLS handshakes. And they set an alarm on it. And it kept going off. And they reported it to us.

Slow TLS handshakes across our Anycast network are something we need to take seriously. So this set off several weeks of engineering effort to isolate the problem. Two things complicated the analysis.

First: looking at all of our own metrics, we could spot slow-than-usual latencies. But they weren't extreme, not like the customer was seeing.

Next, running packet-level analytics, we weren't able to spot any intra-handshake latency coming "from us"; there was never an instance where a browser sent us a TLS message, and we sat around for a couple hundred milliseconds twiddling our thumbs, and only later sent us a response. You send us a TCP SYN? Pow! We get you a SYN+ACK. You send us a TLS ClientHello? Blammo! A ServerHello.

What we did notice were specific clients that appeared to have large penalties on “their side” inside the TLS handshake; a delay of up to half a second between the TCP ACK for our ServerHello and the ChangeCipherSpec message that would move the handshake forward. In other words: we would send them our response, and those clients would (maybe) sit around twiddling their thumbs figuring out what to do next.

Reverse DNS for all these events traced back to AWS.

This was interesting evidence of something, but doesn't dispose of the question of whether there are events causing our own Anycast proxies to lag.

To better catch and distinguish these cases, what we needed was a more fine-grained breakdown of the handshake. What we want are packet-level analytics of every slow TLS handshake happening across our fleet. Problem: we do a lot of TLS handshakes. By time metrics notice a handshake is slow, it's too late to start capturing packets, so you need to be capturing all the time.

A global tcpdump would produce a huge pcap file on each of our edges within within seconds, and even if we could do that, it would be painful to filter out connections with slow TLS handshakes from those files.

Instead, Peter ended up:

  1. making fly-proxy log all slow handshakes; and
  2. writing a tool called petertron4000 (real infra-log heads may remember its progenitor, petertron3000) to temporarily save all packets in a fixed-size ring buffer, ingest the slow-handshake log events from systemd-journald, and
  3. write durable logs of the flows corresponding to those slow handshakes.

With petertron4000 in hand, Peter was able to spot check a random selection of edges. The outcome:

Peter was able to reproduce the problem directly on EC2: running 8 concurrent curl requests on a t2.medium reliably triggers it. Accompanying the TLS lag: high CPU usage on the instance. Further corroborating it: we can get the same thing to happen curl-ing Google.

This was an investigation set in motion by an ugly looking graph in the metrics we expose, and not from any “organically observed” performance issues; a customer saw a spike and wanted to know what it was. We're pretty sure we know now: during the times of those spikes, they were getting requests from overloaded EC2 machines, which slow-played a TLS handshake and made our metrics look gross. :shaking-fist-emoji:.

Having said all that: dragnetting slow TLS handshakes uncovered a bunch of routing issues, and we're on a quest this summer to shake all of those out. Peter's new quest: how do you roll the petertron4000 (call it the petertron4500, or the petertron4000DX2) so that it can run permanently and generate alerts? Fun engineering problem! We have a bunch of ideas, and whatever happens will land on the infra-log. Tune in!

Also, check out Fly Metrics. They're great.

Stay classy, devs.

— The Fly.io Team


UnsubscribeManage SubscriptionsView in Browser