Running Code from Strangers

by
Ehsaan Iqbal
Olivia Rhye
11 Jan 2022
5 min read

Running arbitrary code from strangers on the internet is hard. It's a security nightmare wrapped in a resource allocation puzzle, topped with scalability concerns — and it’s costly.

At Livedocs, we’re building a modern cloud-native notebook. Livedocs lets you combine Python, SQL, Dynamic text, and visualizations in one interactive canvas. When we started out, we wanted to build a way for data nerds and analytics-focused teams to start exploring data collaboratively with very little yak-shaving. 

Here’s a story of how the Livedocs runtime evolved over the few months of intense building, some lessons from each step, and how the architecture evolved.

Scene 1: The Age of Python Begins

We chose Python because it’s the go-to language for data science and analytics. Most of our users are already familiar with Python, and it offers powerful libraries for data wrangling and visualization. Compared to JavaScript or other languages, Python is simply a better fit for data work, making it the obvious choice for our runtime.

Our first stab at this was using Pyodide, a port of CPython to WebAssembly, which runs Python directly in the browser. The appeal was clear:

  • The code ran safely in the user’s browser.
  • Users could execute code without hitting our backend.
  • Embedding Pyodide into our web app was refreshingly simple.

But running Python code alone wasn’t enough; we needed it to do something useful. So, we exposed some helper functions, like allowing users to ingest their data directly into a managed BigQuery instance which they could query later using SQL.

Simple and elegant – until it wasn’t. We quickly hit some walls:

  • Some requests to APIs were outright blocked due to CORS errors.
  • There was no way to schedule periodic data ingestion.
  • Remember that requests happen just-in-time in the browser, and then data is transported from the client to our backend using the helper functions. This is limited to reasonable payload sizes and is not really scalable.

This approach was not the best and didn’t last long.

Scene 2: The Cloud Functions Era

To address the limitations, we needed a way to:

  1. Run scripts on a schedule.
  2. Offload execution from the browser to the cloud.
  3. Escape CORS hell!

We thought these user-written scripts would be ideal to run on the edge in a lambda-like environment. After considering a few FaaS options, we chose Google Cloud Functions since our infrastructure was already on GCP. Users could quickly prototype in Livedocs using Pyodide, set a schedule, and let it run in the background. Data could still be pushed to BigQuery, but it was done from the cloud function using Pub/Sub.

Snapshot of the Pyodide-based editor

However, we soon encountered several pain points:

  • Slow deployment times and occasional failures to run.
  • Maintaining the pipeline was frustrating, with execution discrepancies varying across different environments.
  • Compute and time limits on Cloud Functions, which weren’t flexible enough for heavy tasks.
  • The need to inject required imports and functions into user code—no simple str_a + str_b task! We experimented with Tree-sitter before ultimately adopting Python’s ast module.
  • Frequent Cloud Function invocations quickly drove up costs.

While Cloud Functions were a step forward, the solution still felt piecemeal and fragmented. We realized we needed a more integrated system to connect Python and SQL, which led to our next big leap.

Scene 3: Firecracker to the Rescue?

We decided to provide each user with a complete, isolated environment to run their code, much like Jupyter notebooks. This setup offered us the flexibility, security, and scalability we needed. For the Python runtime, we chose the battle-tested IPython.

By dedicating a Python runtime through the IPython kernel to each document (our version of a notebook), we enabled users to tackle any task they required. To manage communication between the kernel (an ipykernel instance) and the web client, we built a service we affectionately call Middleman. Middleman communicates with the kernel using ZMQ over IPC, maintains a Directed Acyclic Graph (DAG) to keep our documents reactive, and allows multiple users to collaborate in real-time. We implemented Middleman in Rust, primarily for performance. While I won’t dive into all the details in this article, it’s a key component that keeps Livedocs responsive and fast—a story we’ll save for a future blog post.

Running multiple kernels in their own processes on a single powerful machine was considered, but it would have required implementing low-level OS guardrails to control file system access, networking, and resource allocation (think cgroups) between the processes.

That’s when we had an idea: what about something like a VM, but without the usual overhead? Enter Amazon’s open-source Firecracker, originally developed for their Lambda runtime. Firecracker, described as a microVM, offered fast boot times, a strong security model, and simple resource control.

We immediately saw the potential to run individual kernels in Firecracker VMs, but managing the VMs ourselves felt overly complex. Luckily, Fly.io had already solved this, providing a managed environment for deploying lightweight VMs. Fly.io offered everything we needed:

  • The promise of “Boots in 250ms or Less”, which addressed cold-start issues.
  • Reasonable compute costs for the features we wanted to offer our users.
  • A clean API to work with.
  • Deployments close to customer locations.
  • Bonus: GPU support for the future.

The Plan:

Middleman would live in a single, high-powered VM, while multiple microVMs running the kernels would be deployed on Fly.io. A service layer would sit between Fly.io and us, so when users created a new doc, a Fly VM would spin up for them, and we’d proxy their traffic to that VM over the wire. 

What more could we ask for? Life was good—until we hit a major snag.

Fly.io didn’t allow their VMs to have static public IPs (as of five months ago). This turned into a problem, as our users needed static IPs to whitelist for database connections. That one’s on me—should’ve read the docs. 🙂

One workaround was to set up a proxy service on GCP, routing all user traffic through a single, manageable endpoint. We could use WireGuard to route traffic from Fly VMs to our GCP proxy, but that would add another layer of complexity to manage.

Or, Fly.io could have provided a list of public IPs per region. (Spoiler: they didn’t.) After a few DMs to Fly’s CEO went unanswered, it was clear: time to pivot again.

The Final Act: Enter Kubernetes, the OG

With Fly.io ruled out, we turned to the mighty, complex, but powerful option: Kubernetes. We had previously avoided it for reasons familiar to anyone who has managed a k8s cluster—complexity, slow container boot times, and, especially, the challenges it poses for what we wanted to do: run code from strangers on the internet. However, for our needs, the benefits outweighed the drawbacks.

Livedocs Today

10,000-Foot Overview of Livedocs

Our current architecture runs on GCP’s Kubernetes cluster, with each document assigned its own pod. Rather than the previously planned centralized Middleman setup, each pod now runs an independent instance of the Middleman + IPython kernel duo.

We designed a pod discovery service, Watchman, which allocates pods to documents. When a user creates a new doc, our backend queries Watchman to find and allocate an available pod based on the user’s plan and other specific requirements. Watchman then responds with the pod details, allowing the web client to connect to the allocated pod via WebSockets. With this model, we have achieved the following benefits:

  • Document-level isolation makes it easier to scale horizontally.
  • IPC communication between kernels and Middleman instances running on the same pod reduces latency significantly.
  • Improved control over pod lifecycles and stability.
  • Cost-effective GCP spot pod usage, enabling us to offer better machines to free users while providing additional compute for paid users.

To minimize the delay when a user clicks "Create Doc" on the UI and connects to a pod, we need to speed up the transition. Spinning up a pod on demand proved too slow, so we implemented a pool of pre-configured, ready-to-go pods that can bootstrap immediately for new documents. As pods are allocated, Watchman manages a buffer of these ready pods, spawning new ones as needed to maintain the buffer ratio based on current demand.

While effective, this approach has its downsides. Upgrading to a new container runtime version requires downtime to replace all pods. Furthermore, managing so many moving parts makes end-to-end testing complex. Observability is also critical here: monitoring pod activities in real time is essential to ensure performance and security. We use a combination of Prometheus and Grafana for detailed, real-time insights into pod health, resource consumption, and performance metrics.

Kubernetes pods are not entirely secure by default, as there are multiple attack vectors to address. Here are some of the steps we have taken to harden them:

  • Disable pod-to-pod networking.
  • Use network policies to segment the cluster from the rest of the infrastructure.
  • Limit privileges for pods by using namespaces.
  • Employ lean Docker images to minimize vulnerabilities from base images.
  • Enforce resource quota limits to avoid potential overload or misuse.

We plan to further enhance security by conducting regular audits, implementing seccomp profiles, and adopting additional security practices as needed.

With each layer of the stack under our control, we now have the flexibility to manage machines, monitor resources, and optimize performance effectively. However, we acknowledge that this architecture may not be the perfect solution. We anticipate learning a lot over the next few months and will continue refining and improving it.

And that’s it for this one, folks!


What’s Next?

With the new Kubernetes-based architecture in place, we’ve laid the ground work for some of our most ambitious features yet. As we continue to scale and improve Livedocs, we’ll be able to provide even more powerful tools for data collaboration and exploration.

Want to see it in action? Try Livedocs today at livedocs.com!

Share this post
Olivia Rhye
11 Jan 2022
5 min read