Scaling Rails WebSockets in Kubernetes with AnyCable
Learn how to scale real-time Rails applications backed by AnyCable using Kubernetes and Linkerd.
One of the things I love the most about Rails is how easy it makes implementing many features with very little setup from my part. One example is Action Cable for web sockets. Before ActionCable, whenever I needed to build realtime features I had to resort to separate tools or perhaps a dedicated hosted service if scale was an issue (for example we have been using Sendbird for years at Brella).
ActionCable made everything so much easier in terms of feature implementation, but with a catch: it really doesn't perform all that well with many concurrent clients. As soon as you start having a few thousands of clients active at the same time, latency goes beyond what is considered "realtime".
Here comes in AnyCable. It's a fantastic project by the awesome team at Evil Martians (FOSS but with optional paid support and features, so check those out) that brilliantly solves the scalability issue that makes ActionCable less appealing for serious web sockets setups.
The key component is a Go process that takes care of actually handling the web sockets; since Go is much better at concurrency than Ruby, performance with many concurrently connected clients is much better. CPU and memory usage are also very low compared to what ActionCable requires with the same amount of clients.
The Go process just handles web sockets connections, but isn't aware of any of the business logic, which stays in your Rails app. The link between the two is an additional process for your Rails app that takes care of doing things like managing data and whatever else your app does. The two communicate via gRPC protocol, so you must take this into account when designing the system for load balancing, as we'll see later.
In this post we'll see:
how to switch from regular ActionCable to AnyCable
how to deploy the required components to Kubernetes
how to solve an issue with load balancing
how to perform some benchmarks.
Let's dive in.
Setting up AnyCable in Rails
The basic setup for AnyCable simply requires you to add the anycable-rails gem and run the bundle exec rails g anycable:setup
command to create/update some config files. One of these files is config/cable.yml,
which is the default config file for ActionCable. In here you need to change the broadcasting adapter to any_cable, i.e.:
production:
adapter: any_cable
Then you need to specify the URL for your Redis instance (required for pub/sub) in config/anycable.yml
:
production:
redis_url: redis://....:6379
Next, you need to edit the config file for the target environment and change the cable URL:
# config/environments/production.rb
config.action_cable.url = "wss://domain.com/ws-path"
It's up to you whether you want to use a separate hostname for the web sockets or just a dedicated path in your app's main domain. For the chat feature we are building for Brella I have opted for a path in the same domain, as we'll see later when talking about ingress.
With this, you can now start the Rails part of AnyCable (the RPC process) with the following command:
RAILS_ENV=production bundle exec anycable
Of course, you also need to run the Go process to complete the setup. First, you need to install it (e.g., brew install anycable-go
on macOS; see also this), then run anycable-go --host=localhost --port=8080
to start it. AnyCable-Go will connect to the RPC process on the port 50051 by default, but you can customize it with the --rpc_host ip:port
argument.
That's it basically for a quick setup to test with AnyCable. Now WebSockets will be handled by AnyCable instead of regular ActionCable as long as the clients use the new cable URL.
In most cases you won't need to change much else in order to just swap ActionCable for AnyCable. However, while compatibility is great, there are still a few differences. For example you cannot use regular instance variables in a channel class, because of the way the connection objects are handled by AnyCable. So instead of something like this
class MyChannel < ApplicationCable::Channel
def subscribed
@somevar = ...
end
def send_message
do_something_with @somevar
end
end
You'll have to use state_attr_accessor:
class MyChannel < ApplicationCable::Channel
state_attr_accessor :somevar
def subscribed
self.somevar = ...
end
def send_message
do_something_with somevar
end
end
Just a small thing to remember. For the channels in the Brella backend I didn't need to change anything else but take a look at this page for more info on some differences you may encounter.
Deployment in Kubernetes
1. The Go process
The easiest way to deploy AnyCable-Go is with the official Helm chart. Take a look at the README for details on which configuration options you can set. In my case I am installing the chart with these settings:
helm repo add anycable https://helm.anycable.io/
helm upgrade --install \
--create-namespace \
--namespace myapp \
--set anycable-go.replicas=3 \
--set anycable-go.env.anycablePath=/ \
--set anycable-go.env.anycableRedisUrl=/ \
--set anycable-go.env.anycableRpcHost=myapp-rpc:50051 \
--set anycable-go.env.anycableLogLevel=debug \
--set anycable-go.serviceMonitor.enabled=true \
--set anycable-go.ingress.enable=false \
--set anycable-go.env.anycableHeaders='authorization\\,origin\\,cookie' \
anycable-go anycable/anycable-go
2. The RPC process
The RPC can be a regular Kubernetes deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-rpc
labels:
app.kubernetes.io/name: myapp-rpc
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 50%
maxUnavailable: 0
selector:
matchLabels:
app.kubernetes.io/name: myapp-rpc
template:
metadata:
labels:
app.kubernetes.io/name: myapp-rpc
spec:
containers:
- name: myapp
image: ...
command:
- bundle
- exec
- anycable
- --rpc-host=0.0.0.0:50051
- --http-health-port=54321
securityContext:
allowPrivilegeEscalation: false
ports:
- name: http
containerPort: 50051
protocol: TCP
readinessProbe:
httpGet:
path: "/health"
port: 54321
scheme: HTTP
initialDelaySeconds: 25
timeoutSeconds: 2
periodSeconds: 5
successThreshold: 1
failureThreshold: 10
env:
- name: ANYCABLE_RPC_POOL_SIZE
value: "40"
- name: MAX_THREADS
value: "50"
This is the relevant part taken from the manifest I use for our Helm chart. The important bits are:
the command: here we are running the anycable RPC process ensuring it's exposed at 0.0.0.0 so it can be connected to from outside the pod; we also specify the health check port to enable health checks;
the port, which is by default 50051
the readiness probe, which uses the health check to ensure the pod is available to process requests only when ready to do so
a couple of environment variables to configure AnyCable's own thread pool size, as well as the database pool size (typically the database.yml is configured to set the db pool size to the number specified with the MAX_THREADS env variable, especially with Puma based apps; adjust the variable name if needed). Make sure the db pool size is greater than the thread pool size.
Next we need a service:
apiVersion: v1
kind: Service
metadata:
name: myapp-rpc
labels:
app.kubernetes.io/name: myapp-svc-rpc
spec:
ports:
- port: 50051
targetPort: 50051
protocol: TCP
name: rpc
selector:
app.kubernetes.io/name: myapp-rpc
This is what AnyCable Go will be connecting to.
3. Ingress
You can, if you want, enable a separate ingress for AnyCable Go with the settings for its Helm chart, but I prefer to keep web sockets under a path in the app's default domain.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
labels:
app.kubernetes.io/name: myapp
spec:
tls:
- hosts:
- domain.dom
secretName: domain-tls
rules:
- host: domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: "myapp-web"
port:
number: 3000
- path: /ws
pathType: Prefix
backend:
service:
name: "myapp-anycable-go"
port:
number: 8080
This will ensure that all the requests going to domain.com/ws actually go to the AnyCable Go process.
gRPC Load Balancing
If you try this setup, it should work, but if you look closely you will notice that only one of the RPC pods actually handles the requests, while the others don't and remain unutilized.
This is because gRPC requires per request load balancing, while Kubernetes' default service types perform connection based load balancing. One workaround could be using a headless service, but this only works if the grpc client can perform DNS based load balancing by itself. In any case, I prefer using a service mesh since not only does it fix the load balancing issue with grpc, it also enhances observability and security with your deployments so it's always convenient to have.
In this case, I use Linkerd, which is a very fast and lightweight service mesh. The way this works is that Linkerd automatically injects a proxy container into the RPC pods that intercepts all the requests to the pod. This proxy container can handle grpc requests properly so the issue with load balancing is no longer an issue.
Setting up Linkerd for this is very easy. First, you need to install the CLI:
curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh
and verify that you can install it in your cluster:
linkerd check --pre
Assuming that all the checks are green, you can proceed with the installation:
linkerd install | kubectl apply -f -
And ensure again that everything went smoothly:
linkerd check
Then you need to add an annotation to the spec template for the RPC pods:
annotations:
linkerd.io/inject: enabled
Also, you need to reinstall AnyCable Go with this additional setting:
--set "pod.annotations.linkerd\.io\/inject=enabled"
That's it! This will make sure that when the pods are created, they are injected with the Linkerd proxy containers and the grpc load balancing will work as expected.
You don't need to, but I recommend you install the awesome viz extension for observability of the "meshed" services:
linkerd viz install | kubectl apply -f -
linkerd check
Then, to open the dashboard:
linkerd viz dashboard
to see lots of useful metrics in realtime.
Bonus: private GKE clusters
At Brella we use private GKE clusters for our staging and production environments. "Private" means that neither the control plane nor the nodes are reachable directly from the Internet, which is awesome for security. In order to access these clusters, we use a bastion host with a proxy for the Kubernetes API; the bastion host is also private, so we can only access it with authenticated connections via the Google Identity Aware Proxy. This is a bit out of scope for this article, but I wanted to mention that if you, too, use private GKE clusters, you'll run into an issue with Linkerd not working properly. I won't go into details here, but I recommend your read this page on how to set up a firewall rule that fixes this issue.
Benchmarking web sockets
Depending on your use case, you may want to run some benchmarks in order to understand whether your needs warrant the switch from ActionCable to AnyCable. I took inspiration from this page by the authors of AnyCable (I used smaller instances for my tests though), and installed websocket-bench.
To use the benchmark as in the examples, I created a test channel at app/channels/benchmark_channel.rb:
class BenchmarkChannel < ApplicationCable::Channel
STREAMS = (1..10).to_a
def subscribed
Rails.logger.info "a client subscribed"
stream_from "all#{STREAMS.sample if ENV['SAMPLED']}"
end
def echo(data)
transmit data
end
def broadcast(data)
ActionCable.server.broadcast "all#{STREAMS.sample if ENV['SAMPLED']}", data
data["action"] = "broadcastResult"
transmit data
end
end
Then I ran the benchmark while monitoring logs and Linkerd's dashboard for both the Go and RPC pods:
websocket-bench broadcast $WS_URL --concurrent 8 --sample-size 100 --step-size 1000 --payload-padding 200 --total-steps 10 --server-type=actioncable
This tests with batches of clients until it reaches 10K connections, all sending some messages. From my initial experimentation with this, I found that with a few replicas for both the Go process and the RPC process I was able to see a median rtt of 200ms for 1K clients and 700ms for 10K clients, and this was with fairly slow E2 instances, meaning that performance should be better with the C2 instances we use in production.
Wrapping up
I wrote this post quickly since a few people asked me about it. All in all I am very impressed with the progress made with AnyCable so far (I've used it in the past but it was early days so it's much better today). It requires some setup compared to zero setup with regular ActionCable, but it solves the problem with websockets performance and scalability while allowing you to keep everything in house.
I love it, it's currently one of my favorite projects. Having said that, I recommend you always think about whether your web sockets needs require more than what regular ActionCable can offer, due to the additional setup involved with AnyCable.Â
In our case, we are an event management platform and networking is our killer feature, and being able to exchange messages with other attendees in an event prior to having meetings with them is a must have feature for us. We also wanted to detach from the requirement for a third party service and at the same time we wanted to be able to scale web sockets in the case we happen to manage big events that can cause significant spikes in chat usage. Your mileage may vary, so always evaluate whether ActionCable is good enough for your use case.