
We're back! This is kinda fun, right? We tell you the stuff we've been working on, you respond if you need help, yada yada. Good thing we had the totally original thought of doing a newsletter; we really like talking to you.
Okay, here's what's up with us lately:
Product Updates
There's a lot happening on our Fresh Produce site.
The biggest update is: we're building a GitHub Launch UI! Right now, it supports NodeJS, Phoenix, Laravel, Django, Python, Golang, and repos with an existing Dockerfile, and we're working on adding more as fast as we (reasonably) can. If you want to use it right now, head here.
Some other updates for you:
- We have a new "Unified Billing" feature, which lets you get one invoice for all your apps across your main and linked organizations. We're hoping this helps streamline things, so you're not juggling multiple invoices.
- You can now suspend and destroy Machines from the dashboard. It's been possible to start and stop them using these buttons for a while; now you have even more powers.
- We send you emails when deploys fail; now, we're also including the reason the deploy failed, so you can spend less time figuring that out and move on faster.
Inside Fly-Ball
If you caught our billing update last time, you should know: we're still focused on money stuff. This month, that means figuring out how to use the resources we already have in order to give you a better Fly.io experience. For example:
- We've been in this game for a while, so we've experienced the slow-drip anxiety of public cloud metered billing. Since we're allowed to experiment here (as long as Kurt doesn't wake up and find out), we're trying Accident Forgiveness, which is exactly what it sounds like. If you're a paid Support Customer, and you see something unexpected on your bill, you just…tell us. And we'll take it off. Your deployment mistakes don't really cost us anything, so we don't want you to freak out about them.
- We just lowered the prices on NVIDIA L40s GPUs to $1.25 per hour. Folks are running through the A10s in our lineup, so we want to spread the love and let the L40s get in on the action. (Also: we asked our team "what are people even gonna do with GPUs?" and they got excited and made a whole bunch of videos about it. Turns out the answer is: you can add your own models to Ollama, run it locally in 30 seconds, or stream responses from an LLM, and a zillion other things that we won't make you read here, but you can see on our YouTube channel if you're in the mood).
- We're rolling out Machine Reservations, a new way to save 40% on compute costs when you buy blocks of compute time in advance (more details in the original announcement here). You can mix and match these blocks for Performance or Shared Machines, tailoring your savings to your actual usage patterns. Also: reservations are backdated to the start of the month. So if you decide mid-month that reservations make sense for your workload, you'll still get the full benefit. Ready to try it out? Head to the Machine Reservations section on your billing dashboard to get started.
Partner News
- Managed MySQL is coming! We're talking secure, fully managed, production-grade MySQL, built on Percona and managed by us. The waitlist is here if you wanna get on it.
- You can now use Depot to build images for your apps on Fly.io. It's pretty freaking fast, but we also dig Depot because it saves your build cache up to a certain size, then starts evicting the least used bits, so builds stay cached for much longer. You can get started by running
fly deploy --depot
.
Feature Story
For this month's newsletter, something a little different, with apologies to anyone who reads the Fly.io Infra Log (you should! it's great! and only sometimes terrifying!), because this ran there first. Fly.io is today pleased to bring you the story of the petertron4000
.
One of the features we offer to applications hosted on Fly.io are a dense set of Prometheus performance metrics, along with Grafana dashboards to render them. Not enough of our customers have dug into this; they are really powerful, you can set alerts with them, forward them elsewhere, and, like a customer we're about to talk about, hold our feet to the fire.
One of the things our users can monitor with these metrics is TLS handshake latency. That is: when browsers make connections to our Anycast edge, and negotiate TLS, we track the time it takes to do that handshake (the negotiation, the certificate, the key exchange, etc). Why do we expose TLS latency metrics? We don't know. It was easy to do. Our metrics are very powerful and we take little credit for it because Prometheus, Vector, and Grafana are doing all the lifting.
Anyways, a consequence of exposing TLS latency metrics is that customers can, if they’re watchful, notice spikes in TLS latency. And so someone did. And they noticed something weird: a pattern of slow TLS handshakes. And they set an alarm on it. And it kept going off. And they reported it to us.
Slow TLS handshakes across our Anycast network are something we need to take seriously. So this set off several weeks of engineering effort to isolate the problem. Two things complicated the analysis.
First: looking at all of our own metrics, we could spot slow-than-usual latencies. But they weren't extreme, not like the customer was seeing.
Next, running packet-level analytics, we weren't able to spot any intra-handshake latency coming "from us"; there was never an instance where a browser sent us a TLS message, and we sat around for a couple hundred milliseconds twiddling our thumbs, and only later sent us a response. You send us a TCP SYN
? Pow! We get you a SYN+ACK
. You send us a TLS ClientHello? Blammo! A ServerHello
.
What we did notice were specific clients that appeared to have large penalties on “their side” inside the TLS handshake; a delay of up to half a second between the TCP ACK
for our ServerHello
and the ChangeCipherSpec
message that would move the handshake forward. In other words: we would send them our response, and those clients would (maybe) sit around twiddling their thumbs figuring out what to do next.
Reverse DNS for all these events traced back to AWS.
This was interesting evidence of something, but doesn't dispose of the question of whether there are events causing our own Anycast proxies to lag.
To better catch and distinguish these cases, what we needed was a more fine-grained breakdown of the handshake. What we want are packet-level analytics of every slow TLS handshake happening across our fleet. Problem: we do a lot of TLS handshakes. By time metrics notice a handshake is slow, it's too late to start capturing packets, so you need to be capturing all the time.
A global tcpdump
would produce a huge pcap
file on each of our edges within within seconds, and even if we could do that, it would be painful to filter out connections with slow TLS handshakes from those files.
Instead, Peter ended up:
- making fly-proxy log all slow handshakes; and
- writing a tool called
petertron4000
(real infra-log heads may remember its progenitor,petertron3000
) to temporarily save all packets in a fixed-size ring buffer, ingest the slow-handshake log events fromsystemd-journald
, and - write durable logs of the flows corresponding to those slow handshakes.
With petertron4000
in hand, Peter was able to spot check a random selection of edges. The outcome:
- Several latency issues we were able to trace to bad routing upstream.
- Reoccurances of the issue described above, where clients are lagging, on “their side”, in the middle of the handshake, overwhelmingly sourced from AWS.
Peter was able to reproduce the problem directly on EC2: running 8 concurrent curl
requests on a t2.medium
reliably triggers it. Accompanying the TLS lag: high CPU usage on the instance. Further corroborating it: we can get the same thing to happen curl
-ing Google.
This was an investigation set in motion by an ugly looking graph in the metrics we expose, and not from any “organically observed” performance issues; a customer saw a spike and wanted to know what it was. We're pretty sure we know now: during the times of those spikes, they were getting requests from overloaded EC2 machines, which slow-played a TLS handshake and made our metrics look gross. :shaking-fist-emoji:.
Having said all that: dragnetting slow TLS handshakes uncovered a bunch of routing issues, and we're on a quest this summer to shake all of those out. Peter's new quest: how do you roll the petertron4000
(call it the petertron4500
, or the petertron4000DX2
) so that it can run permanently and generate alerts? Fun engineering problem! We have a bunch of ideas, and whatever happens will land on the infra-log. Tune in!
Also, check out Fly Metrics. They're great.
Stay classy, devs.
— The Fly.io Team