HomeCustomersPricingBlog
Back
  • May 10, 2023
  • 21 min read

Optimizing our Encryption Engine

The Evervault Encryption Engine (E3) is an internal service that handles all crypto operations for our Relay & Cages products. A typical task for E3 is to iterate through a JSON object and encrypt/decrypt relevant fields. As E3 handles secret keys, we run it in an Enclave. By design, Enclaves have limited observability, which can make issues difficult to diagnose in production.

Recently we onboarded a new customer, let's call them Acme, who was using Response Encryption with Outbound Relay. Acme started running batch jobs every 15 minutes, sending lots of traffic through Outbound Relay. Acme required that all of the JSON fields in each response needed to be encrypted by E3.

We have synthetic clients which are designed to simulate the experience of a regular user. For Outbound Relay, they send a request which requires 100 encryptions. If the latency of these synthetic requests goes over a set limit, an alarm goes off to warn us of degraded performance. Once these Acme batch jobs started running, the synthetic alarms began firing.

Quick fix

Our first priority was to ensure our other customers wouldn't be affected for long. We doubled the number of instances of Outbound Relay and E3 in the region Acme was sending traffic through. Thankfully, this brought the performance of the synthetics back to a reasonable level, but they were still alarming more than they were previously.

Setting up a load testing client

Once we'd calmed down a bit, it was time to try to reproduce the behavior which caused the synthetic alarms to fire. We analyzed Relay's logs and discovered that during the time period of the synthetic errors, the latency of Relay's requests to E3 were spiking. To investigate this, we needed a load testing tool that would allow us to analyze how different traffic patterns affected E3's performance. We decided to go with Artillery.

We figured that rather than load testing via requests through Relay, making requests directly to E3 would be more effective for the following reasons:

  • By removing the reliance on Relay, we would be able to set up a load testing environment containing only the failing service, which we believed was E3.

  • We wanted to use Artillery's convenient latency statistics to measure E3's performance. Making requests through Relay would skew these latencies, as they would include the additional network hops and the operations done by Relay. This meant we'd need our load testing client to have the ability to send requests directly to E3. We couldn't use any of the built-in engines in Artillery, such as their HTTP engine, because of the way Relay communicates with E3:

  • It uses an internal serialization crate, written in Rust, to generate MessagePack-serialized RPC messages.

  • All connections to E3 must be done over mTLS.

We decided to write our own Artillery engine for load testing E3 directly. My search queries of "how to make a custom artillery engine" and "how to use artillery" probably have me on the NSA watchlist, but we eventually found the resources necessary. The Artillery team was also very helpful, swiftly answering all of our queries. We decided to build off of this existing artillery-engine-tcp Node.js repo, and used advice in this issue discussing custom engines. Although this custom TCP engine hadn't been touched in 5 years, it was still compatible with Artillery! Props to Artillery for maintaining such standards for backward compatibility.

To configure mTLS, we simply swapped out the net package for tls and gave the engine access to an mTLS client cert. We used napi to generate node bindings for our Rust serialization crate. This allowed the engine to generate requests to ask E3 to do a specified number of encryptions. These bindings return Node Buffers containing the serialized messages, which we send to E3 over the mTLS socket.

Once we had our custom engine set up, we unlocked the power of Artillery. We could now simulate complex traffic scenarios by simply making changes to a config file.

First load test

We took a look at the traffic pattern during the time periods when the synthetics were alarming, and found that a load test with the following behaviour would replicate it: 2 requests per second for 90 seconds:

  • 80% of requests require 30 encryptions
  • 18% of requests require 150 encryptions
  • 2% of requests require 2,000 encryptions

Here's the corresponding Artillery config file to run this exact scenario, using our custom E3 engine:

1config:
2  target: "<ip-of-E3-instance>"
3  e3:
4    port: 3443
5    key: <path-to-mTLS-client-key>
6    cert: <path-to-mTLS-client-cert>
7    rpc: <path-to-serialization-bindings>
8
9  phases:
10    - arrivalRate: 2
11      duration: 90
12
13  engines:
14    e3: {}
15
16scenarios:
17  - name: "Send 30 encryptions"
18    engine: e3
19    weight: 80
20    flow:
21      - count: 1
22        loop:
23        - send:
24            numEncrypts: 30
25
26  - name: "Send 150 encryptions"
27    engine: e3
28    weight: 18
29    flow:
30      - count: 1
31        loop:
32        - send:
33            numEncrypts: 150
34
35  - name: "Send 2000 encryptions"
36    engine: e3
37    weight: 2
38    flow:
39      - count: 1
40        loop:
41        - send:
42            numEncrypts: 2000

We ran this against an isolated E3 instance, and it passed with flying colours! Which was a bad thing… Going into the load test, we expected the latency of the small requests to increase over time, due to the influx of bad requests, but they didn't.

We needed to figure out what the difference was between our load test and what was happening in Relay. We found in Relay's logs that for a given number of encryptions, Relay's requests to E3 took way longer (~5x) than our requests directly to E3. As both requests were made within AWS VPCs, we doubted network latency would cause such a discrepancy.

It was time to put our John Carmack hats on and step into the debugger. We ran Relay and E3 locally and stepped through what happens during Outbound Response Encryption. We noticed that one difference was the version of the encrypted strings returned from E3. Outbound Response Encryption doesn't have any configuration for which elliptic curve to use, and it was defaulting to using the secp256k1 (k1) curve. This wasn't necessarily a shocker, as this was just an older version of our encryption scheme, which is still used in various services. This was the only big difference we could find compared to our load test, though, so we needed to investigate.

Benchmarking our Rust Cryptography crate

We have another internal rust crate for our crypto operations. We plan to open-source this soon (and we mean it, like how we recently open-sourced Cages). We used criterion to add benchmarks to this crate for encrypting data using the k1 curve, and the secp256r1 (r1) curve. For some background on where these curves are used in our encryption scheme, here are some relevant steps:

  • Each Evervault app (tenant) has a set of elliptic keypairs called the app keypairs. This set includes a k1 keypair and an r1 keypair
  • For every encryption, another elliptic keypair is generated, either k1 or r1, called the ephemeral keypair
  • The ephemeral private key is used with the relevant app public key to derive a shared secret
  • This shared secret is used as a symmetric key for AES-encrypting the data

There are some extra details in our encryption scheme, such as running a KDF on the shared secret after derivation, but the steps listed above are the most relevant for this post. The chosen curve is relevant when generating the ephemeral keypair, and when deriving the shared secret. When we ran our benchmarks on our local dev machines (M1 Macbook Pros), we found that encrypting a string was 4x faster using the r1 curve than using the k1 curve. This was a shocker. We ran the same benchmarks on the instance type E3 was running on (m5.xlarge EC2, x86), and saw that r1 encryption was 5x faster than k1. This explained the 5x difference between our load testing client and Relay. This finding demonstrates the value of load testing a service directly. If we had been load testing E3 via Relay, we likely wouldn't have noticed this.

We added some more configuration options to our Artillery engine so we could specify a specific curve in the encryption requests. Once that was in place, the load test was re-run, specifying the k1 curve for encryption. We then finally saw the behavior we expected.

To show the effect large requests had on small requests in our load test, the plot below displays metrics for those which had 30 encryptions and those which had 2k encryptions. A metric of encrypted fields per second (encryption speed) is estimated by dividing the number of encrypted fields in each request by the latency of the request. The plot shows that if multiple large requests occurred within a few seconds, small requests would experience a degraded encryption speed.

Now that we had replicated the behaviour, we decided to switch the default curve used by Outbound Response Encryption to r1. We figured that this would both mitigate the current synthetic issues, and improve the latency of our users' requests.

Let's look at some production traffic to show the effect of the switch to r1. A plot is shown below of the encryption speed of Outbound Relay's requests to E3. Only requests which require >5k encryptions are included. This ensures that the latency consists primarily of work done by E3 and isn't skewed by network latency. As expected, the encryption speed went up 5x after the curve change.

Optimization: Switch the default curve
Impact: 5x increase in encryption speed

Increasing the capacity of an E3 instance

Our load tests showed us that as little as two big requests could cause degraded performance for the remaining requests. Sure, we've increased the speed by 5x now, but that just means it would now take two 10k encryption requests, rather than two 5k encryption requests, to produce the same behaviour. We needed to understand why two was the magic number.

This one was obvious enough. The machine that E3 runs on has 4 vCPUs. As E3 runs in an Enclave, the Enclave is allocated a subset of these cores. The host instance needs to be allocated an even number of vCPUs, as vCPUs are paired, which leaves 2 vCPUs for E3. Encryptions are a blocking operation. So if a request comes in to do 2k encryptions, 1 vCPU will be occupied until those 2k encryptions are complete. It's obvious then that if two large requests come in at the same time, both of E3's vCPUs will be occupied, and unable to process any more incoming requests.

To give an instance more capacity, it needs more vCPUs. We decided to switch from xlarge instances to 2xlarge instances. 2xlarge instances have 8 vCPUs in total, which leaves 6 vCPUs to E3. The difference this makes in cost-efficiency is best demonstrated by an example of switching from four xlarge instances to two 2xlarge instances:

Instance TypeCores per instanceE3 Cores per instanceNum. InstancesTotal E3 cores
m5.xlarge4248
m5.2xlarge86212

As 2xlarge instances cost exactly double the xlarge price, this is more E3 cores / $.

Optimization: Increase the number of vCPUs per E3 instance
Impact: Each instance can handle more requests in parallel, at the same cost

Phantom Requests

After the optimizations made so far, E3 should have enough speed and capacity to handle Acme's batch jobs. The number of synthetic errors decreased, but there were still more than before Acme arrived. We needed to dig into the remaining synthetic errors. They were all occurring during Acme batch jobs, which suggested that it was related. We trawled through the logs and found one similarity between all of the remaining errors: before every failed synthetic request, a large (>300KB) Acme encryption request was made to the same E3 instance. As a single request only uses one E3 vCPU, we didn't expect it to affect other requests, even though it was large.

This led us to another observation: all requests >300KB were failing. The encryption request would be sent to E3, and Relay would never receive a response, even after 5 minutes. The requests seemed to just disappear once they left Relay.

Time to reproduce. We configured our load testing client to send a similar payload size (60,000 encryptions), and it passed with flying colors again! Which we again didn't expect… Back to the debugger. We were able to replicate the issue locally. On stepping through the flow, we eventually found that E3 would only receive ~200KB of the request, and then Relay would give up on sending it. Not great.

There was a subtle bug. When sending the request to E3, Relay uses tokio's write_all function to write all of the bytes in the request to the socket. What we didn't realize is that write_all doesn't flush the bytes, i.e. it doesn't guarantee that all of the bytes will be sent to the destination by the time the write_all(request).await finishes. In our case, they appeared to just never be written. When debugging, we discovered that they would be written eventually, but only on the next time that socket was used to send a request (we use connection pooling, so the socket is eventually re-used).

It seems likely that this caused the synthetic to error afterward: the synthetic request would arrive, the old socket would be re-used while it's still in a bad state, and Relay would fail to send the encryption request to E3.

The fix for this was simple. We just needed to add a call to flush directly after our call to write_all. The large requests started succeeding, and the remaining synthetic errors disappeared.

Optimization: Ensuring all bytes are flushed to the socket in Relay requests to E3
Impact: The number of encryptions in a single request is no longer limited by the payload size
Now that our synthetic errors were completely eradicated, we were able to downscale E3 to pre-incident levels! While we're at it… E3 is on the critical path in Evervault. It is vital for both our security and our availability. This has made us reluctant to make optimizations or refactors to E3, as the risk usually seems to outweigh the reward. "If it ain't broke.." etc. etc. Now that E3 had begun to cause problems, it gave us an excuse to optimize it further while we were working on it. Here are some more changes we made.

Hardware

Considering ARM machines are all the rage nowadays; we decided to try running our benchmarks on AWS's Graviton machines. They're cheaper than their Intel machines, and ARM is also supposed to be faster, right? We ran benchmarks on a c7g instance and found that it was faster for k1 encryptions 😁 (~20%), but slower for r1 encryptions 😑(~25%), which is now our default. Intel's hardware optimizations for the r1 curve seemed superior to Graviton's. This was a bit disappointing.

While thinking about instance types though, we realized that m5 instances aren't necessarily the best Intel instance for E3. They are general-purpose machines, whereas E3 is currently limited by its compute speeds. We ran our benchmarks on a c5 (compute-optimized) instance, and they were faster all round (~17% each)! This was great news, as c5 is also cheaper than m5.

Here's another plot of production data to show the increase in encryption speed. Note: we actually deployed this change before switching to the r1 curve, which is why the encryption speed was still down around 1k encryptions per second.

Optimization: Switching to compute-optimized (Intel) instances
Impact: Encryptions are 17% faster, at a lower price

Dedicated Networking Thread

Up until now, E3 has used a single tokio runtime for all of its work, both networking and encryption. When E3 became congested from the large payloads, we noticed that the blocking crypto operations starved E3’s networking. In one sense, this served to prevent additional work being scheduled on congested instances. But it could also result in a congested instance being marked unhealthy, as E3 couldn't receive and respond to health checks.

As part of our optimizations, we refactored E3’s runtime. All of the blocking operations were updated to use the tokio spawn_blocking API, moving them off of the runtime’s primary thread pool, and onto a set of dedicated blocking threads. The runtime limited the number of blocking threads to keep a single thread available for networking, making sure a congested E3 could always respond to health checks.

Optimization: E3 uses a dedicated blocking thread pool for crypto operations
Impact: Congested E3 instances remain healthy

Cancel Requests

Sometimes, clients will time out when making requests to Relay. This is usually when the destination server takes too long to give its response. There can sometimes be situations where the client times out while E3 is doing encryptions for that request. In these cases, E3 would continue to finish its workload, even though that encrypted data won't be used for anything. E3 should ideally never do unnecessary work.

To ensure E3 is able to free itself from doing unnecessary work, we added a new RPC message for Cancel Requests. A cancel request contains the ID of a message which no longer needs to be processed. To support cancellations in E3, we updated our crypto library to take an optional parameter of a tokio sync channel receiver. After every encryption in the payload, the library checks if a message has been received. If so, the operation exits early.

Now when E3 receives a new encryption request, it creates a sync channel and writes its sender into a hashmap under the message ID. Whenever a cancel request comes in, we check the hashmap for a sender corresponding to the ID in the cancel request. If the sender exists, we send an empty message into the channel to cancel the remaining encryptions.

Optimization: Relay can tell E3 to cancel the work it's doing
Impact: E3 is doing less unnecessary work

Per-request Encryption Keys

For every encryption, we generate a new ephemeral keypair and derive a new shared secret to use as the AES key. These two operations take up the majority of the per-string encryption time, as the actual AES encryption of the bytes is very fast. We decided to consider per-request encryption keys. This means that if a payload needs to be encrypted, all of the data in the payload will use the same ephemeral keypair, and AES key. This changes the security model slightly:

  • Before, using a per-string ephemeral keypair: if an attacker managed to get the AES key of an encrypted string, they would only be able to decrypt that string.
  • After, using a per-request ephemeral keypair: if an attacker managed to get the AES key of an encrypted string, they would be able to decrypt that string and any other string which was encrypted in the same request to E3.

The impact of this depends on how many strings a user typically encrypts in a request. If they only encrypt one string at a time, every string in their database will correspond to a different AES key. If they encrypt 100 strings at a time, there will be sets of 100 strings in their database which correspond to the same AES key. In our backend SDKs, we take a similar approach, where the keys are re-generated whenever a time limit is reached.

This might sound worrying, but there is currently no feasible way for attackers to derive one of the AES keys of an encrypted string. So this change is not risky.

The performance improvements of this change are huge. After the encryption of the first string in a payload, to encrypt the remaining strings, only the AES encryption step is needed. The keys are only generated once per request.

In our benchmarks, we found that after the first encryption in a request, the remaining encryptions would be up to 300x faster (about 500k per second). We decided this was the correct move.

Optimization: Generating the encryption keys once per-request
Impact: After the first encryption in a request, the remaining encryptions are up to 300x faster

However, this 500k per second number isn't what we see yet in requests from Relay to E3. It seems that the performance of the serialization/deserialization of the RPC messages back and forth currently hinders the speeds we can reach to ~70k encryptions/second. The next plot shows the impact of switching to per-session keys in production. Note: now that encryptions are so fast, network latency has a higher effect on our metric, which is what causes it to look unstable.

Further optimizations

While working on this project, we had some more ideas which we haven't yet implemented. Here are a few.

Decryption Key Caching

Now that we have per-request decryption keys, there may be payloads sent to E3 for decryption which contain strings that were encrypted with the same keys as each other. As we iterate through a payload, rather than re-deriving the shared secret every time, we could maintain a per-request map of ephemeral public keys to their corresponding derived secrets.

Configuring OpenSSL for hardware optimizations

E3 uses rust-openssl to interface with OpenSSL for its crypto operations. Rust-openssl uses openssl-src to create a statically linked build of OpenSSL. It is unclear if this build uses all of the hardware optimizations it could. For example, it seems that the enable-ec_nistp_64_gcc_128 flag is unset. This flag allegedly makes ECDH operations 2 to 4 times faster on x86 machines (see OpenSSL's Configuration Options section).

Serialization

In our benchmarks, we've seen our per-string encryption speed reach up to 500k encryptions/second. However, this isn't the latency we see in requests from Relay to E3. It seems that the performance of the serialization/deserialization of the RPC messages back and forth currently hinders the speeds we can reach to ~80k encryptions/second.

Migrate to using libsecp256k1

The reason k1 encryptions are so slow is because OpenSSL isn't optimized for the k1 curve. This is the curve used in Bitcoin, so there are other libraries which have tackled this problem. We plan to migrate to using the secp256k1 crate in our k1 encryptions, as it boasts faster operations.

Conclusion

E3 is now much faster at encrypting payloads, especially when there are many encryptions in the request. We have also improved the cost efficiency of scaling E3 by switching to larger, compute-optimized instances. This exploration was a great learning experience for us, and we hope you got something out of it as well. To close, here's a plot which shows how our encryption speed improved throughout the project:

If this sparked any ideas about further optimizations we could make, feel free to reach out!

David Nugent

Engineer

Related Posts