class: center, middle, inverse, title-slide # The R Infrastructure ## How we build stuff ### Jeroen Ooms ### 2018/09/14 --- background-image: url(utrecht.jpg) background-position: 50% 50% # Infrastructure is never done... --- # Hello World About me: PhD Statistics UCLA 2014 (Jan de Leeuw, Mark Hansen). Currently I am postdoc at UC Berkeley with the [rOpenSci](https://ropensci.org/) group. ![team](team.png) --- background-image: url(unconf.jpg) background-position: 50% 50% # The rOpenSci (extended) Family --- # CRAN Packages [![pkgscreen](packages2.png)](https://cran.r-project.org/web/checks/check_results_jeroen_at_berkeley.edu.html) --- background-image: url(screen3.png) background-position: 50% 50% # Also this --- class: inverse, center, middle # PART I: Base Infrastructure --- # Base Dependencies To the R user, the dependency system looks mostly like this: ![diagram](diagram1.png) --- # Base Dependencies However R itself also depends on other software: ![depends](depends.png) --- # Base Dependencies So the reality is more like this: ![diagram2](diagram2.png) --- class: inverse, center, middle # What are all these libraries used for? --- # BLAS / LAPACK: Linear Algebra Most statistical methods involve matrix calculations (QR, Cholesky SVD, etc). R uses high performance BLAS / LAPACK routines for linear algebra. ![blas](blas.png) --- # BLAS / LAPACK: Linear Algebra For example when R calculates `\({\displaystyle {\hat {\boldsymbol {\beta }}}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {y} ,}\)` ```r # define X matrix and y vector X <- as.matrix(cbind(1,cars$speed)) y <- as.matrix(cars$dist) solve(t(X) %*% X) %*% t(X) %*% y ``` ``` [,1] [1,] -17.579095 [2,] 3.932409 ``` ```r # As done by lm() coef(lm(dist~speed, data = cars)) ``` ``` (Intercept) speed -17.579095 3.932409 ``` --- # LIBCURL: Networking R uses libcurl for downloading files over FTP/HTTP/HTTPS. This functionality is used in e.g. `download.file()` and `install.packages()`. ![libcurl](libcurl.png) --- # LIBCURL: Networking Note that https uses an SSL connection, which requires encryption support. ```r install.packages("MASS", repos = 'https://cloud.r-project.org') ``` ``` trying URL 'https://cloud.r-project.org/bin/macosx/el-capitan/contrib/3.5/MASS_7.3-50.tgz' Content type 'application/x-gzip' length 1163764 bytes (1.1 MB) ================================================== downloaded 1.1 MB ``` --- # LIBICU: Text and Encoding ICU (International Components for Unicode) is a used for converting text encoding and string comparision. ![icu](icu.png) --- # PCRE: Regular Expressions R exposes several regular expression functions such as `grep` `regexpr`, but also uses regular expressions internally. ![pcre](pcre.png) --- # LIBICU: Text and Encoding This should yield the same results as in other languages: ```r validate_ip_address <- function(x){ ip_addr_rexex <- "\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b" grepl(ip_addr_rexex, x) } validate_ip_address(c("127.0.0.1", "1.1.1.1", "1000.1.1.1")) ``` ``` [1] TRUE TRUE FALSE ``` And results need to be consistent across platforms and locales: ```r authors <- c("Hadley Wickham", "Gábor Csárdi", "谢益辉") enc2native("Gábor Csárdi") == authors ``` ``` [1] FALSE TRUE FALSE ``` ```r grep("益", authors, value = TRUE) ``` ``` [1] "谢益辉" ``` --- # CAIRO: graphical rendering Cairo is used to render graphics, i.e. to convert the shapes, attributes and text from the R graphics device into a bitmap image. ![cairo](cairo.png) --- # FONTCONFIG: (via cairo) Finding Fonts Want your plot axis to show sans-serif italic labels? Fontconfig finds the appropriate font that is available on your system. ![freetype](fontconfig.png) --- # FREETYPE: (via cairo) Rendering Text Freetype then combines the text with font data (font, size, style) to render the actual figures (glyph image) that form the readable characters in your graphic. ![freetype](freetype.png) --- # LIBPNG, LIBJPEG, LIBTIFF A bitmap is merely a matrix of pixels. Additional libraries are needed to export the bitmap to various image formats that other software will understand. ![libpng](libpng.png) --- # CAIRO, FREETYPE, FONTCONFIG, LIBPNG Think for a second about the calculations required to determine the color of each pixel in a bitmap image based on a few simple simple shapes. ```r plot.new() plot.window(xlim = c(0, 100), ylim = c(0, 100)) polygon(c(10, 40, 80), c(10, 80, 40), col = 'hotpink') text(40, 90, labels = 'My drawing', col = 'navyblue', cex = 3, family = "Times") symbols(c(70, 80, 90), c(20, 50, 80), circles = c(10, 20, 10), bg = c('yellow', 'orange', 'red'), add = TRUE, lty = 'dashed') ``` ![](index_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- # Why External Libraries R relies on external libraries to do the heavy lifting. This is great because these libs are: - Widely used - Portable (work on all systems) - Performant - Thoroughly tested - Well maintained - Free It would be impossible to implement all this functionality ourselves in R. --- class: inverse, center, middle # I don't remember installing these things? --- # Static vs Dynamic Linking The way these libraries are installed depends on your operating system. On operating systems that install R via a package manager (Linux, Homebrew), R dynamically links to the shared libraries. The package manager automatically installs the dependencies when the user installs R. <table> <tr> <th></th> <th>Native Compiler</th> <th>Native Package Manager</th> <th>Linking</th> </tr> <tr> <th>Linux</th> <td>yes</td> <td>yes</td> <td>dynamic</td> </tr> <tr> <th>MacOS</th> <td>yes</td> <td>no</td> <td>static</td> <td></td> </tr> <tr> <th>Windows</th> <td>no</td> <td>no</td> <td>static</td> </tr> </table> On systems that do not have a native package manager (Windows, MacOS), we have to statically link the libs into the R binaries that we ship in the installer. --- # Dynamic Linking ![apt1](apt1.png) --- # Dynamic Linking ![apt2](apt2.png) --- # Static Linking With static linking, external libraries get embedded into the binaries (`R.dll` in this case). ![rbig](rbig.png) --- # Building R for Windows Windows does not have a native compiler nor package manager. To build R for Windows: -- 1. Install build environment with compiler (e.g. rtools) -- 2. Build required libraries with this compiler -- 3. Build base R with this compiler and static link to the libs -- 4. To build R packages, we need same compiler, same R, and same libs. --- # Building R for Windows The scripts and libs used to build base R for Windows and the installer are open source: ![base1](base1.png) --- # Building R for Windows The readme explains how you can build R locally. Or you can look at the script. ![base2](base2.png) --- # Building R for Windows The script runs every night on appveyor and on success, the installers get uploaded to CRAN. ![base3](base3.png) --- class: inverse, center, middle # PART II: CRAN and Scalability --- # Building Packages Just like R itself, many R packages take advantage of external libraries. A few of the older CRAN packages that use external libraries: <table> <tr> <th>R Package</th> <th>Required libs</th> <th>CRAN release</th> </tr> <tr> <td>RMySQL</td> <td>libmysqlclient</td> <td>2000</td> </tr> <tr> <td>XML</td> <td>libxm2</td> <td>2000</td> </tr> <tr> <td>RCurl</td> <td>libcurl</td> <td>2004</td> </tr> <tr> <td>gmp</td> <td>gmp</td> <td>2004</td> </tr> <tr> <td>Rmpfr</td> <td>libmpfr</td> <td>2009</td> </tr> </table> -- On Linux, the user has to install the required libraries manually when installing the R package. -- For Windows and MacOS, CRAN (or any other repo) can build so called binary packages that include the __statically linked external library__, just like we did for base R. --- # Building Packages Statically linked binary packages make installing R packages easy on Windows / MacOS: ![xml2](xml2.png) --- # Building Packages But: it is not trivial to make this work. -- Somebody has to build (and occasionally update) the library and it's dependencies using the same compiler and flags for static linking with the R package. -- Building even a single library can be a lot of work. Unfortunately these libraries are also becoming increasingly complex and interdependent. -- ![fortran](fortran.jpg) --- # CRAN growing On top of that, the number of packages on CRAN is growing rapidly: ![cran](cran2.png) Many of packages require one or more external libraries... --- # The rwinlib Organization The Github organization 'rwinlib' is an archive of static libraries for libs used by numerous CRAN packages that were built with Rtools on Windows. ![rwinlib](rwinlib.png) --- # Database Drivers Most databases have specialized clients libraries: <table> <tr> <th>R Package</th> <th>Required libs</th> </tr> <tr> <td>RMySQL, RMariaDB</td> <td>libmariadb + openssl</td> </tr> <tr> <td>RPostGres, RPostgreSQL</td> <td>libpq + openssl</td> </tr> <tr> <td>RODBC, odbc</td> <td>unixodbc</td> <tr> <td>redux, rredis, RcppRedis</td> <td>hiredis</td> </tr> </tr> <tr> <td>mongolite</td> <td>mongo-c-driver + openssl + libsasl</td> </tr> </table> --- # GDAL: spatial abstraction library A complex example is GDAL (Geospatial Data Abstraction Library) that can read and write 100+ different spatial data formats (think maps and sattelite images). ![gdal](gdal.png) --- # GDAL: spatial abstraction library The current rwinlib GDAL2 stack depends on no less than 35 additional driver libraries! It is used to build the R binary packages for `sf`, `rgdal`, and `rgeos` on Windows. ![gdallibs](gdallibs.png) --- # GDAL: spatial abstraction library Package authors sometimes request extra features from the libraries: ![gdaledzer](gdaledzer.png) --- # GDAL: spatial abstraction library And now R users on Windows can access open access EU sattelite images in `sf` and `rgdal`! ![tweet](tweet.png) --- # Imaging, Graphics, and Vision Another example: At rOpenSci we are developing on a suite of packages to expose high quality images libraries in R across various applications and fields: - Spatial (as seen before) - Medical (MRI) - Graphics and post processing - Vision - OCR - Animation and Video - Rendering pdf, svg All of these tools use high quality open source libraries. We provide the R interfaces. --- # OCR (TESSERACT) ```r library(magick) image_read("https://jeroen.github.io/images/receipt.png") %>% image_resize('50%') ``` <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="113" /> ```r library(tesseract) numbers <- tesseract(options = list(tessedit_char_whitelist = "$.0123456789")) text <- ocr("https://jeroen.github.io/images/receipt.png", engine = numbers) cat(text) ``` ``` $90.52 $81.52 $9.00 $90.52 ``` --- # Vision (OPENCV) OpenCV has built-in filters for detecting human shapes... ![vision1](opencv1.jpg) --- # Vision (OPENCV) Or faces: ![vision2](opencv2.jpg) --- # Animated Graphics ```r library(gganimate) p <- ggplot(airquality, aes(Day, Temp)) + geom_line(size = 2, colour = 'steelblue') + transition_states(Month, 4, 1) + shadow_mark(size = 1, colour = 'grey') animate(p, fps = 25, width = 800, height = 350) ``` ![](index_files/figure-html/unnamed-chunk-8-1.gif)<!-- --> --- class: inverse, center, middle # PART III: Ongoing Developments --- background-image: url(road.jpg) background-position: 50% 50% # Infrastructural Work --- # RTOOLS 40 We are currently beta-testing a new version of Rtools that includes a full build environment and package manger. ![rtools40](rtools40.png) --- # RTOOLS 40 This will make it easier to build, distribute and install external libraries on Windows. ![rtools-packages](rtools-packages.png) --- # RTOOLS 40 The system will also it possible to automate building libs on AppVeyor. This makes things more transparent, maintainable and reproducible. ![rtoolsav](rtoolsav.png) --- # RTOOLS 40 A beta version of rtools 40 and a version of R that has been configured for rtools40 is available from CRAN: https://cloud.r-project.org/bin/windows/testing/rtools40.html ![rtoolscran](rtoolscran.png) --- # RHUB R-hub is a service for building and checking R packages. Part of the project is indexing the system requirements (including libraries) for R packages, and expose this via an API: ![sysreqdb](sysreqdb.png) --- # RHUB R-hub uses this to automatically install the correct software and libraries on each of the supported operating systems, before building the R package. ![rhubbuilders](rhubbuilders.png) --- # VISION The overal idea is to create an infrastructure that can support the increasingly complex R packages with powerful system libraries, while reducing the maintenance work for the repository maintainers. --- background-image: url(artist.jpg) background-size: cover # Artist Impression