Making libcurl work in webassembly
TLDR: we explain how to make libcurl based applications work in webassembly without changes by tunneling all traffic over a websocket proxy.
For a quick demo, check out https://github.com/r-wasm/ws-proxy
Porting R to WebAssembly
Webr is a port of the R language and its package ecosystem to WebAssembly. Many R packages rely on well-known C/C++ libraries to do the heavy lifting, and fortunately most of these libraries can be built with emscripten without too much trouble. However one major barrier that remains is the limited networking capabilities in wasm.
Data science makes heavy use of http for all kinds of things, and a lot of R code directly or indirectly depends on our libcurl bindings for networking. Somewhat surprisingly, libcurl can be compiled with emscripten without needing any patches. However by default it is not very useful because the browser runtime does not have capabilities to open tcp connections, which are needed for doing pretty much anything in libcurl.
To understand why, one should realize that WebAssembly does not have sockets, as it effectively needs to wrap JavaScript networking interfaces from the browser, like fetch. This provides far more restricted functionality than what we need for libcurl, and many things just won’t work at all with fetch (see e.g. pyodide constraints for an overview of the same problem from our Python colleagues).
However it turns out there is a relatively simple way to make existing code work in WebAssembly, by letting libcurl route all traffic over a websocket based proxy server. This solution is secure and easy to implement, but it is not very well explained in the emscripten docs, so hopefully this post helps to clear some things up.
Sockets as Websockets
The emscripten documentation on networking hints at the first part of the solution:
If you have existing TCP networking code written in C/C++ that utilizes the Posix Sockets API, by default Emscripten attempts to emulate such connections to take place over the WebSocket protocol instead. For this to work, you will need to use something like WebSockify on the server side to enable the TCP server stack to receive incoming WebSocket connections.
This terse description had me confused for a bit, but it is actually very clever.
When C code opens a tcp connection to myserver.com:1234, the wasm runtime compiled by emscripten instead tries to connect to a websocket on ws://myserver.com:1234. If the server does not expect this, the connection will probably just fail.
But the idea is that on your server, you can run a special websockify proxy which accepts the incoming http connection and tunnels traffic from the websocket to the actual thing you wanted to connect to. Hence, by wrapping the socket into a websocket, and with some help from the server, emscripten has a way to establish tcp connections from the wasm sandbox to the outside world.
WS vs WSS Websockets
An important detail to be aware of is that emscripten currently defaults to ws:// (non-https) connections for this method. This is a bit unfortunate because most browsers nowadays silently block http requests from webapps served on a https webpage due to mixed content security policies.
To make emscripten connect over wss:// (https based websockets) instead, you need to build your application with -sWEBSOCKET_URL=wss://. You can also override this at runtime in JavaScript by calling:
SOCKFS.websocketArgs.url = 'wss://';
Obviously wss:// requires that your websockify server has properly signed https certs.
Btw, you can also provide a full URL for this parameter to make emscripten open the websocket to a completely different host/port than the one requested by the client. This can be helpful to run a websockify proxy on a different host than what the client application is assuming.
Routing Curl Traffic over a SOCKS5 Proxy
The above has limited use in itself because we can only connect to servers that run this websockify thing. But we want libcurl to connect to arbitrary HTTP services. The solution is to host a SOCKS5 proxy server with websockify in front of it, and instruct curl to route connections through this proxy server. This turns out to work very well.
What is a SOCKS5 proxy
If you have not used SOCKS proxies, an easy way to try this is using the built in socks server from your ssh client. Let’s forget about WebAssembly for one second and open an ssh session to any server (localhost or any other) using the -D flag followed by a port number:
ssh -D7777 localhost
Now while this ssh session is alive, a local SOCKS server is available on port 7777, which will route incoming traffic through your ssh server. Test this by setting an environment variable ALL_PROXY when calling curl, for example:
ALL_PROXY="socks5h://localhost:7777" curl -v https://google.com
You should see that curl does not connect to the host directly but instead performs the http request via the proxy server. We can use this ALL_PROXY variable in the same way for any libcurl based applications.
SOCKS5 Proxy over a Websocket
We can now put one and one together and use websockify to expose the SOCKS5 proxy to WebAssembly. If you run the websockify command line utility you would use:
websockify 7778 localhost:7777
So this creates a double proxy, or more specifically a reverse proxy followed by a forward proxy: websockify bridges incoming http websocket connections on port 7778 to the local SOCKS5 server running on port 7777. And this SOCKS5 server can be used by libcurl to connect to an arbitrary host on the internet.
A Production Setup
Instead of tinkering with ssh and websockify, we can use the following docker container which also includes a websockify and a SOCKS5 server:
docker run -it -p7777:7777 ghcr.io/r-wasm/ws-proxy
This container also enables authentication for the SOCKS5 server, see the Dockerfile for details how this is done.
For my demo server below I actually run this server behind CloudFlare, which takes care of the HTTPS certs 1 and improves performance. So we actually go through 3 proxies:
      (https)               (http)               (tcp)           (https)
wasm ---------> cloudflare --------> websockify -------> SOCKS5 ---------> host
This may sound terrible but actually going through CloudFlare improves performance rather than making it slower, because of their superb CDN network.
There are many other ways of going about this; both websockify and SOCKS5 are not too complicated and have been implemented in many languages, so you can probably put the stack together in the language of your choice. If you end up building a very performant rust/go stack, please do let me know :-)
Some real world testing
The easiest way to see this in action is using the WebR test UI. For testing purposes I have set up a public demo server on a cheap devbox at https://ws.r-universe.dev. This runs the container mentioned above, with the default credentials, via CloudFlare.
Below is a small non-trivial example. If you have not used R before, just copy paste the code in the editor thingy and press run.
# Install the R package
install.packages('curl')
# Set the ws-proxy server
Sys.setenv(ALL_PROXY='socks5h://test:yolo@ws.r-universe.dev:443')
# From here everything is normal R code:
df <- read.dcf(curl::curl('https://cran.rstudio.com/src/contrib/PACKAGES'))
pkgs <- df[1:200, 'Package']
urls <- sprintf('https://cran.rstudio.com/web/packages/%s/DESCRIPTION', pkgs)
destfiles <- sprintf('~/%s.txt', pkgs)
results <- curl::multi_download(urls, destfiles, verbose = TRUE)
all(results$status == 200)
# Read one of the files to show it is there
list.files('~')
readLines("~/abc.txt")
What this code does is read an index file that contains the list of R packages from CRAN, and subsequently download the description files of the first 200 packages to the user home directory (which is actually a virtual filesystem in WebR).
A few things to notice:
- Downloading the 200 files is quite fast because it uses HTTP/2 multiplexing.
- The only special thing needed to make this work in WebAssembly is the line setting ALL_PROXY. Everything else is regular R code.
- If you inspect the devtools network tab of your browser, you see that everything happens over a single WebSocket to wss://ws.r-universe.dev. The browser is not making the HTTP requests, in fact this would not even be possible because we download the files from a host that does not enable CORS.
For more examples, including how to do this with your own proxy server, check out https://github.com/r-wasm/ws-proxy
Final thoughts
I really like this solution because is natively works with emscripten, and requires no patching of libcurl or porting of our applications. We can make existing R code run in WebAssembly simply by setting this single environment variable, and things will just work as expected for the user.
The solution is also secure in the sense that the client does not need to accept a MITM certificate or compromise much on privacy otherwise. SOCKS5 proxies are pretty dumb, in a good sense, so they just route encrypted HTTPS connections without being able to look into it. Obviously, it does leak the host that the client is connecting to, but HTTP requests remain encrypted.
I guess the only thing that would make this even nicer if is Cloudflare would do the websockify part on their end, such that we can take our websockify proxy out of the equation. This should not be too difficult for them and would make it even easier to expose tcp services (proxies, databases, etc) to wasm clients.
- It is easiest to use cloudflare flexible ssl mode such that your don’t need SSL certificates on the websockify server. ↩︎