Chapter 8 GridFS

A recent addition to mongolite is support for GridFS. A GridFS is a special type of collection for storing binary data such as files. To the user, a GridFS looks very much like a key-value store which supports potentially very large values.

screencast

8.1 Connecting to GridFS

The GridFS API is different from normal Mongo collections. Connecting works similar as with regular mongo() with one important difference: instead of specifying a collection name, we need to set a name prefix:

library(mongolite)
fs <- gridfs(db = "test", prefix = "fs")

Under the hood, GridFS stores files in two collections. In the case of prefix = "fs":

fs.chunks stores the binary chunks.
fs.files stores the file metadata.

Hence the prefix identifies the GridFS .chunks and .files data collections that make up the GridFS table.

8.2 GridFS Methods

To get an overview of available methods, print the gridfs object. The methods are described in the ?gridfs manual page.

print(fs)

#> <gridfs> 
#>  $disconnect(gc = TRUE) 
#>  $download(name, path = ".") 
#>  $drop() 
#>  $find(filter = "{}", options = "{}") 
#>  $read(name, con = NULL, progress = TRUE) 
#>  $remove(name) 
#>  $upload(path, name = basename(path), content_type = NULL, metadata = NULL) 
#>  $write(con, name, content_type = NULL, metadata = NULL, progress = TRUE)

The basic API supports the following operations:

Write (upload) new files into GridFS
Read (download) files from GridFS
Find (list) files in the GridFS
Delete a file

Note that updating (modifying) files in GridFS is currently unsupported: uploading a file with the same name will generate a new file.

8.3 Read / Write

The fs$read() and fs$write() methods send data from/to an R connection, such as a file, socket or url. This is the recommended way to to send/receive data with GridFS. You get a nice progress counter and the transfer can be interrupted if needed.

Both read() and write() methods have a con argument that specifies the input or output connection. You can also pass a string, which is treated as a file path. Here we stream a file from a URL:

# Stream data from a URL into GridFS
con <- url('https://cloud.r-project.org/src/base/R-3/R-3.5.1.tar.gz')
fs$write(con, 'source.tar.gz', progress = FALSE, 
         metadata = '{"This is" : "just a test"}')

#> List of 6
#>  $ id      : chr "5b6ca17a47a302fe131620dc"
#>  $ name    : chr "source.tar.gz"
#>  $ size    : num 29812849
#>  $ date    : POSIXct[1:1], format: "2018-08-09 22:18:02"
#>  $ type    : chr NA
#>  $ metadata: chr "{ \"This is\" : \"just a test\" }"

In addition, for fs$write() you can set con to a raw vector with data to upload.

buf <- serialize(mtcars, NULL)
fs$write(buf, "mtcars", progress = FALSE)

#> List of 6
#>  $ id      : chr "5b6ca17d47a302fe131620dd"
#>  $ name    : chr "mtcars"
#>  $ size    : num 3798
#>  $ date    : POSIXct[1:1], format: "2018-08-09 22:18:05"
#>  $ type    : chr NA
#>  $ metadata: chr NA

The fs$find() method shows a list of files:

# List files in the GridFS
fs$find()

#>                         id          name     size                date type                      metadata
#> 1 5b6ca17a47a302fe131620dc source.tar.gz 29812849 2018-08-09 22:18:02 <NA> { "This is" : "just a test" }
#> 2 5b6ca17d47a302fe131620dd        mtcars     3798 2018-08-09 22:18:05 <NA>                          <NA>

We can read the data to disk using fs$read():

# Stream the file from GridFS to disk
out <- fs$read('source.tar.gz', file('source.tar.gz'), progress = FALSE)
file.info('source.tar.gz')$size

#> [1] 29812849

# Cleanup
unlink('source.tar.gz')

For fs$read() you can set con to NULL in which case the file will be buffered in memory and returned as a raw vector. This is useful for e.g. unserializing R objects.

out <- fs$read('mtcars', con = NULL, progress = FALSE)
df <- unserialize(out$data)
head(df)

#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

When done we trash the GridFS:

fs$drop()

#> [1] TRUE

8.4 Vectorized Upload/Download

The fs$upload() and fs$download() methods provide an alternative API specifically for transferring files between GridFS and your local disk. This API is vectorized so you can transfer many files at once. However transfers cannot be interrupted and will block R until completed. This API is only recommended to upload/download a large number of small files.

# Start a new GridFS and upload all files from this book
mb <- mongolite::gridfs(prefix = 'mongobook')
mb$upload(list.files(".", recursive = TRUE))

#>                          id               name     size                date             type metadata
#> 1  5b6ca17d47a302fe131620de      _bookdown.yml       89 2018-08-09 22:18:05             <NA>     <NA>
#> 2  5b6ca17d47a302fe131620df          _build.sh       80 2018-08-09 22:18:05 application/x-sh     <NA>
#> 3  5b6ca17d47a302fe131620e0        _output.yml      342 2018-08-09 22:18:05             <NA>     <NA>
#> 4  5b6ca17d47a302fe131620e1  01-connecting.Rmd     9135 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 5  5b6ca17d47a302fe131620e2     02-queries.Rmd     5998 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 6  5b6ca17d47a302fe131620e3      03-insert.Rmd     3121 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 7  5b6ca17d47a302fe131620e4      04-export.Rmd     2296 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 8  5b6ca17d47a302fe131620e5 05-calculation.Rmd     2622 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 9  5b6ca17d47a302fe131620e6    06-commands.Rmd      491 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 10 5b6ca17d47a302fe131620e7      07-gridfs.Rmd     4465 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 11 5b6ca17d47a302fe131620e8          dump.json 10986497 2018-08-09 22:18:05 application/json     <NA>
#> 12 5b6ca17d47a302fe131620e9       flights.bson 79319562 2018-08-09 22:18:05             <NA>     <NA>
#> 13 5b6ca17e47a302fe131620ea          index.Rmd     3575 2018-08-09 22:18:06  text/x-markdown     <NA>
#> 14 5b6ca17e47a302fe131620eb      fighist-1.png    48856 2018-08-09 22:18:06        image/png     <NA>
#>                                                                                      path
#> 1                               /Users/jeroen/workspace/mongobook/mongobook/_bookdown.yml
#> 2                                   /Users/jeroen/workspace/mongobook/mongobook/_build.sh
#> 3                                 /Users/jeroen/workspace/mongobook/mongobook/_output.yml
#> 4                           /Users/jeroen/workspace/mongobook/mongobook/01-connecting.Rmd
#> 5                              /Users/jeroen/workspace/mongobook/mongobook/02-queries.Rmd
#> 6                               /Users/jeroen/workspace/mongobook/mongobook/03-insert.Rmd
#> 7                               /Users/jeroen/workspace/mongobook/mongobook/04-export.Rmd
#> 8                          /Users/jeroen/workspace/mongobook/mongobook/05-calculation.Rmd
#> 9                             /Users/jeroen/workspace/mongobook/mongobook/06-commands.Rmd
#> 10                              /Users/jeroen/workspace/mongobook/mongobook/07-gridfs.Rmd
#> 11                                  /Users/jeroen/workspace/mongobook/mongobook/dump.json
#> 12                               /Users/jeroen/workspace/mongobook/mongobook/flights.bson
#> 13                                  /Users/jeroen/workspace/mongobook/mongobook/index.Rmd
#> 14  /Users/jeroen/workspace/mongobook/mongobook/mongobook_files/figure-html/fighist-1.png
#>  [ reached getOption("max.print") -- omitted 6 rows ]

In the same way we can download all files in one command:

files <- mb$find()
dir.create('outputfiles')
mb$download(files$name, 'outputfiles')

#>                          id               name     size                date             type metadata
#> 1  5b6ca17d47a302fe131620de      _bookdown.yml       89 2018-08-09 22:18:05             <NA>     <NA>
#> 2  5b6ca17d47a302fe131620df          _build.sh       80 2018-08-09 22:18:05 application/x-sh     <NA>
#> 3  5b6ca17d47a302fe131620e0        _output.yml      342 2018-08-09 22:18:05             <NA>     <NA>
#> 4  5b6ca17d47a302fe131620e1  01-connecting.Rmd     9135 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 5  5b6ca17d47a302fe131620e2     02-queries.Rmd     5998 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 6  5b6ca17d47a302fe131620e3      03-insert.Rmd     3121 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 7  5b6ca17d47a302fe131620e4      04-export.Rmd     2296 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 8  5b6ca17d47a302fe131620e5 05-calculation.Rmd     2622 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 9  5b6ca17d47a302fe131620e6    06-commands.Rmd      491 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 10 5b6ca17d47a302fe131620e7      07-gridfs.Rmd     4465 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 11 5b6ca17d47a302fe131620e8          dump.json 10986497 2018-08-09 22:18:05 application/json     <NA>
#> 12 5b6ca17d47a302fe131620e9       flights.bson 79319562 2018-08-09 22:18:05             <NA>     <NA>
#> 13 5b6ca17e47a302fe131620ea          index.Rmd     3575 2018-08-09 22:18:06  text/x-markdown     <NA>
#> 14 5b6ca17e47a302fe131620eb      fighist-1.png    48856 2018-08-09 22:18:06        image/png     <NA>
#>                              path
#> 1       outputfiles/_bookdown.yml
#> 2           outputfiles/_build.sh
#> 3         outputfiles/_output.yml
#> 4   outputfiles/01-connecting.Rmd
#> 5      outputfiles/02-queries.Rmd
#> 6       outputfiles/03-insert.Rmd
#> 7       outputfiles/04-export.Rmd
#> 8  outputfiles/05-calculation.Rmd
#> 9     outputfiles/06-commands.Rmd
#> 10      outputfiles/07-gridfs.Rmd
#> 11          outputfiles/dump.json
#> 12       outputfiles/flights.bson
#> 13          outputfiles/index.Rmd
#> 14      outputfiles/fighist-1.png
#>  [ reached getOption("max.print") -- omitted 6 rows ]

list.files('outputfiles')

#>  [1] "_bookdown.yml"      "_build.sh"          "_output.yml"        "01-connecting.Rmd"  "02-queries.Rmd"    
#>  [6] "03-insert.Rmd"      "04-export.Rmd"      "05-calculation.Rmd" "06-commands.Rmd"    "07-gridfs.Rmd"     
#> [11] "dump.json"          "fighist-1.png"      "figstats-1.png"     "flights.bson"       "index.Rmd"         
#> [16] "mongobook.Rmd"      "mongobook.Rproj"    "mtcars.json"        "style.css"          "toc.css"

This makes a convenient way to store an large set of files into GridFS.

8.5 Select by ID

If your GridFS contains duplicate filenames, you can use their unique ID to refer to them:

allfiles <- mb$find()
head(allfiles)

#>                         id              name size                date             type metadata
#> 1 5b6ca17d47a302fe131620de     _bookdown.yml   89 2018-08-09 22:18:05             <NA>     <NA>
#> 2 5b6ca17d47a302fe131620df         _build.sh   80 2018-08-09 22:18:05 application/x-sh     <NA>
#> 3 5b6ca17d47a302fe131620e0       _output.yml  342 2018-08-09 22:18:05             <NA>     <NA>
#> 4 5b6ca17d47a302fe131620e1 01-connecting.Rmd 9135 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 5 5b6ca17d47a302fe131620e2    02-queries.Rmd 5998 2018-08-09 22:18:05  text/x-markdown     <NA>
#> 6 5b6ca17d47a302fe131620e3     03-insert.Rmd 3121 2018-08-09 22:18:05  text/x-markdown     <NA>

To select a file by it’s id, prefix the name with “id:” for example:

id <- paste0("id:", allfiles$id[1])
print(id)

#> [1] "id:5b6ca17d47a302fe131620de"

mb$read(id, tempfile())

#> 
[{ "_id" : { "$oid" : "5b6ca17d47a302fe131620de" } }]: read 89 B (100%)     
[{ "_id" : { "$oid" : "5b6ca17d47a302fe131620de" } }]: read 89 B (done)

#> List of 6
#>  $ id      : chr "5b6ca17d47a302fe131620de"
#>  $ name    : chr "_bookdown.yml"
#>  $ size    : num 89
#>  $ date    : POSIXct[1:1], format: "2018-08-09 22:18:05"
#>  $ type    : chr NA
#>  $ metadata: chr NA

Same for download() and remove():

# Remove a file by id
mb$remove(id)

#>                         id          name size                date type metadata
#> 1 5b6ca17d47a302fe131620de _bookdown.yml   89 2018-08-09 22:18:05 <NA>     <NA>

#> [1] TRUE