R-universe and Cloudflare

How we get fast global routing and caching

Jeroen Ooms

What is R-universe?

Platform to help users find, publish, and use R packages.

With or without CRAN.

Technical notes

  • This is not a static site (like e.g. pkgdown). All pages and cranlike indexes are generated on request from a database.
  • We can distinguish 3 levels of pages (for caching):
    1. Single package
    2. Repository level (packages from a user)
    3. Global pages (all packages)

Cloudflare: proxy CDN

Why cloudflare

R-universe runs on self hosted server in NYC, but all traffic is routed via cloudflare CDN (free).

  • Smaller latency; better global routing
  • CDN level caching
  • ddos protection, flexible compression, connection pooling, etc.

Cloudflare routing experiment

These URLs are the same server and no caching is involved.

# Direct connetion to NYC-3
curl -OL https://dev.opencpu.org/ubuntu-2404.iso

# Same backend server but routed via cloudflare
curl -OL https://proxy.opencpu.org/ubuntu-2404.iso

The further you live from NYC, the slower the first.

But cloudflare routing can max out your bandwidth anywhere.

Cloudflare routing

Conclusion: even without caching, clients can download from r-universe server at high speed and low latency.

From GHA runners we download from r-universe with 60MB/s (Close to Gbit speed).

Application: downloading large snapshots

  • Repositories in R-universe are track git branches/releases
  • Sometimes you want to mirror the repo to a private server or host a frozen copy (cf. MRAN / PPM time-travel)
  • Snapshot API gives you a zip or tgz file with the full repo.
  • Demo: https://tidyverse.r-universe.dev/apis

Cloudflare caching

We also make cloudflare cache things using Cache-Control http response headers.

cache-control: public, max-age=60,

Here public means the CDN may share cache between different clients. Caching age is 1 minute.

BUT: big win is clever cache revalidation based on Etag or Last modified.

Cache Revalidation

Cache Revalidation

Cache Revalidation

Cache Revalidation (server)

// Server Middleware:
// get_latest() looks up the last changed record for a given query from 
// the database. This is cheap.
return get_latest(query).then(function(doc){
  res.set('Cache-Control', `public, max-age=60, stale-while-revalidate=30`);
  if(doc){
    const etag = `W/"${doc._id}"`;
    const date = doc._published.toUTCString();
    res.set('ETag', etag);
    res.set('Last-Modified', date);
    if(etag === req.header('If-None-Match') || date === req.header('If-Modified-Since')){
      res.status(304).send(); //DONE!
    } else {
      next(); //proceed to routing
    }
  }
...

The query determines which R packages are considered that could change the requested html page, when updated.

CDN immutable caching

The CDN server uses SHA256 content-addressed URLs:

CDN immutable caching

Files downloaded from their hash are by definition immutable:

Caching summary:

  • When multiple users (anywhere) request the same file or page within 60s, the request is serviced directly from the cloudflare cache, and never hits the backend server.
  • Most requests that do hit the back-end server can instantly get revalidated (HTTP 304) because no relevant R package(s) have been updated.
  • Server only has to regenerate pages if revant state actually changed in the DB.
  • Then still it is pretty fast :)