class: center, middle, inverse, title-slide # The R Infrastructure ## How we build stuff ### Jeroen Ooms ### 2018/09/14 --- background-image: url(utrecht.jpg) background-position: 50% 50% # Infrastructure is never done... --- # Hello World About me: PhD Statistics UCLA 2014 (Jan de Leeuw, Mark Hansen). Currently I am postdoc at UC Berkeley with the [rOpenSci](https://ropensci.org/) group.  --- background-image: url(unconf.jpg) background-position: 50% 50% # The rOpenSci (extended) Family --- # CRAN Packages [](https://cran.r-project.org/web/checks/check_results_jeroen_at_berkeley.edu.html) --- background-image: url(screen3.png) background-position: 50% 50% # Also this --- class: inverse, center, middle # PART I: Base Infrastructure --- # Base Dependencies To the R user, the dependency system looks mostly like this:  --- # Base Dependencies However R itself also depends on other software:  --- # Base Dependencies So the reality is more like this:  --- class: inverse, center, middle # What are all these libraries used for? --- # BLAS / LAPACK: Linear Algebra Most statistical methods involve matrix calculations (QR, Cholesky SVD, etc). R uses high performance BLAS / LAPACK routines for linear algebra.  --- # BLAS / LAPACK: Linear Algebra For example when R calculates `\({\displaystyle {\hat {\boldsymbol {\beta }}}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {y} ,}\)` ```r # define X matrix and y vector X <- as.matrix(cbind(1,cars$speed)) y <- as.matrix(cars$dist) solve(t(X) %*% X) %*% t(X) %*% y ``` ``` [,1] [1,] -17.579095 [2,] 3.932409 ``` ```r # As done by lm() coef(lm(dist~speed, data = cars)) ``` ``` (Intercept) speed -17.579095 3.932409 ``` --- # LIBCURL: Networking R uses libcurl for downloading files over FTP/HTTP/HTTPS. This functionality is used in e.g. `download.file()` and `install.packages()`.  --- # LIBCURL: Networking Note that https uses an SSL connection, which requires encryption support. ```r install.packages("MASS", repos = 'https://cloud.r-project.org') ``` ``` trying URL 'https://cloud.r-project.org/bin/macosx/el-capitan/contrib/3.5/MASS_7.3-50.tgz' Content type 'application/x-gzip' length 1163764 bytes (1.1 MB) ================================================== downloaded 1.1 MB ``` --- # LIBICU: Text and Encoding ICU (International Components for Unicode) is a used for converting text encoding and string comparision.  --- # PCRE: Regular Expressions R exposes several regular expression functions such as `grep` `regexpr`, but also uses regular expressions internally.  --- # LIBICU: Text and Encoding This should yield the same results as in other languages: ```r validate_ip_address <- function(x){ ip_addr_rexex <- "\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b" grepl(ip_addr_rexex, x) } validate_ip_address(c("127.0.0.1", "1.1.1.1", "1000.1.1.1")) ``` ``` [1] TRUE TRUE FALSE ``` And results need to be consistent across platforms and locales: ```r authors <- c("Hadley Wickham", "Gábor Csárdi", "谢益辉") enc2native("Gábor Csárdi") == authors ``` ``` [1] FALSE TRUE FALSE ``` ```r grep("益", authors, value = TRUE) ``` ``` [1] "谢益辉" ``` --- # CAIRO: graphical rendering Cairo is used to render graphics, i.e. to convert the shapes, attributes and text from the R graphics device into a bitmap image.  --- # FONTCONFIG: (via cairo) Finding Fonts Want your plot axis to show sans-serif italic labels? Fontconfig finds the appropriate font that is available on your system.  --- # FREETYPE: (via cairo) Rendering Text Freetype then combines the text with font data (font, size, style) to render the actual figures (glyph image) that form the readable characters in your graphic.  --- # LIBPNG, LIBJPEG, LIBTIFF A bitmap is merely a matrix of pixels. Additional libraries are needed to export the bitmap to various image formats that other software will understand.  --- # CAIRO, FREETYPE, FONTCONFIG, LIBPNG Think for a second about the calculations required to determine the color of each pixel in a bitmap image based on a few simple simple shapes. ```r plot.new() plot.window(xlim = c(0, 100), ylim = c(0, 100)) polygon(c(10, 40, 80), c(10, 80, 40), col = 'hotpink') text(40, 90, labels = 'My drawing', col = 'navyblue', cex = 3, family = "Times") symbols(c(70, 80, 90), c(20, 50, 80), circles = c(10, 20, 10), bg = c('yellow', 'orange', 'red'), add = TRUE, lty = 'dashed') ``` <!-- --> --- # Why External Libraries R relies on external libraries to do the heavy lifting. This is great because these libs are: - Widely used - Portable (work on all systems) - Performant - Thoroughly tested - Well maintained - Free It would be impossible to implement all this functionality ourselves in R. --- class: inverse, center, middle # I don't remember installing these things? --- # Static vs Dynamic Linking The way these libraries are installed depends on your operating system. On operating systems that install R via a package manager (Linux, Homebrew), R dynamically links to the shared libraries. The package manager automatically installs the dependencies when the user installs R. <table> <tr> <th></th> <th>Native Compiler</th> <th>Native Package Manager</th> <th>Linking</th> </tr> <tr> <th>Linux</th> <td>yes</td> <td>yes</td> <td>dynamic</td> </tr> <tr> <th>MacOS</th> <td>yes</td> <td>no</td> <td>static</td> <td></td> </tr> <tr> <th>Windows</th> <td>no</td> <td>no</td> <td>static</td> </tr> </table> On systems that do not have a native package manager (Windows, MacOS), we have to statically link the libs into the R binaries that we ship in the installer. --- # Dynamic Linking  --- # Dynamic Linking  --- # Static Linking With static linking, external libraries get embedded into the binaries (`R.dll` in this case).  --- # Building R for Windows Windows does not have a native compiler nor package manager. To build R for Windows: -- 1. Install build environment with compiler (e.g. rtools) -- 2. Build required libraries with this compiler -- 3. Build base R with this compiler and static link to the libs -- 4. To build R packages, we need same compiler, same R, and same libs. --- # Building R for Windows The scripts and libs used to build base R for Windows and the installer are open source:  --- # Building R for Windows The readme explains how you can build R locally. Or you can look at the script.  --- # Building R for Windows The script runs every night on appveyor and on success, the installers get uploaded to CRAN.  --- class: inverse, center, middle # PART II: CRAN and Scalability --- # Building Packages Just like R itself, many R packages take advantage of external libraries. A few of the older CRAN packages that use external libraries: <table> <tr> <th>R Package</th> <th>Required libs</th> <th>CRAN release</th> </tr> <tr> <td>RMySQL</td> <td>libmysqlclient</td> <td>2000</td> </tr> <tr> <td>XML</td> <td>libxm2</td> <td>2000</td> </tr> <tr> <td>RCurl</td> <td>libcurl</td> <td>2004</td> </tr> <tr> <td>gmp</td> <td>gmp</td> <td>2004</td> </tr> <tr> <td>Rmpfr</td> <td>libmpfr</td> <td>2009</td> </tr> </table> -- On Linux, the user has to install the required libraries manually when installing the R package. -- For Windows and MacOS, CRAN (or any other repo) can build so called binary packages that include the __statically linked external library__, just like we did for base R. --- # Building Packages Statically linked binary packages make installing R packages easy on Windows / MacOS:  --- # Building Packages But: it is not trivial to make this work. -- Somebody has to build (and occasionally update) the library and it's dependencies using the same compiler and flags for static linking with the R package. -- Building even a single library can be a lot of work. Unfortunately these libraries are also becoming increasingly complex and interdependent. --  --- # CRAN growing On top of that, the number of packages on CRAN is growing rapidly:  Many of packages require one or more external libraries... --- # The rwinlib Organization The Github organization 'rwinlib' is an archive of static libraries for libs used by numerous CRAN packages that were built with Rtools on Windows.  --- # Database Drivers Most databases have specialized clients libraries: <table> <tr> <th>R Package</th> <th>Required libs</th> </tr> <tr> <td>RMySQL, RMariaDB</td> <td>libmariadb + openssl</td> </tr> <tr> <td>RPostGres, RPostgreSQL</td> <td>libpq + openssl</td> </tr> <tr> <td>RODBC, odbc</td> <td>unixodbc</td> <tr> <td>redux, rredis, RcppRedis</td> <td>hiredis</td> </tr> </tr> <tr> <td>mongolite</td> <td>mongo-c-driver + openssl + libsasl</td> </tr> </table> --- # GDAL: spatial abstraction library A complex example is GDAL (Geospatial Data Abstraction Library) that can read and write 100+ different spatial data formats (think maps and sattelite images).  --- # GDAL: spatial abstraction library The current rwinlib GDAL2 stack depends on no less than 35 additional driver libraries! It is used to build the R binary packages for `sf`, `rgdal`, and `rgeos` on Windows.  --- # GDAL: spatial abstraction library Package authors sometimes request extra features from the libraries:  --- # GDAL: spatial abstraction library And now R users on Windows can access open access EU sattelite images in `sf` and `rgdal`!  --- # Imaging, Graphics, and Vision Another example: At rOpenSci we are developing on a suite of packages to expose high quality images libraries in R across various applications and fields: - Spatial (as seen before) - Medical (MRI) - Graphics and post processing - Vision - OCR - Animation and Video - Rendering pdf, svg All of these tools use high quality open source libraries. We provide the R interfaces. --- # OCR (TESSERACT) ```r library(magick) image_read("https://jeroen.github.io/images/receipt.png") %>% image_resize('50%') ``` <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="113" /> ```r library(tesseract) numbers <- tesseract(options = list(tessedit_char_whitelist = "$.0123456789")) text <- ocr("https://jeroen.github.io/images/receipt.png", engine = numbers) cat(text) ``` ``` $90.52 $81.52 $9.00 $90.52 ``` --- # Vision (OPENCV) OpenCV has built-in filters for detecting human shapes...  --- # Vision (OPENCV) Or faces:  --- # Animated Graphics ```r library(gganimate) p <- ggplot(airquality, aes(Day, Temp)) + geom_line(size = 2, colour = 'steelblue') + transition_states(Month, 4, 1) + shadow_mark(size = 1, colour = 'grey') animate(p, fps = 25, width = 800, height = 350) ``` <!-- --> --- class: inverse, center, middle # PART III: Ongoing Developments --- background-image: url(road.jpg) background-position: 50% 50% # Infrastructural Work --- # RTOOLS 40 We are currently beta-testing a new version of Rtools that includes a full build environment and package manger.  --- # RTOOLS 40 This will make it easier to build, distribute and install external libraries on Windows.  --- # RTOOLS 40 The system will also it possible to automate building libs on AppVeyor. This makes things more transparent, maintainable and reproducible.  --- # RTOOLS 40 A beta version of rtools 40 and a version of R that has been configured for rtools40 is available from CRAN: https://cloud.r-project.org/bin/windows/testing/rtools40.html  --- # RHUB R-hub is a service for building and checking R packages. Part of the project is indexing the system requirements (including libraries) for R packages, and expose this via an API:  --- # RHUB R-hub uses this to automatically install the correct software and libraries on each of the supported operating systems, before building the R package.  --- # VISION The overal idea is to create an infrastructure that can support the increasingly complex R packages with powerful system libraries, while reducing the maintenance work for the repository maintainers. --- background-image: url(artist.jpg) background-size: cover # Artist Impression