Chapter 2 Connecting to MongoDB

2.1 Mongo URI Format

The mongo() function initiates a connection object to a MongoDB server. For example:

To get an overview of available methods, simply print the object to the terminal.

#> <Mongo collection> 'mtcars' 
#>  $aggregate(pipeline = "{}", options = "{\"allowDiskUse\":true}", handler = NULL, pagesize = 1000, iterate = FALSE) 
#>  $count(query = "{}") 
#>  $disconnect(gc = TRUE) 
#>  $distinct(key, query = "{}") 
#>  $drop() 
#>  $export(con = stdout(), bson = FALSE, query = "{}", fields = "{}", sort = "{\"_id\":1}") 
#>  $find(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0, handler = NULL, pagesize = 1000) 
#>  $import(con, bson = FALSE) 
#>  $index(add = NULL, remove = NULL) 
#>  $info() 
#>  $insert(data, pagesize = 1000, stop_on_error = TRUE, ...) 
#>  $iterate(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0) 
#>  $mapreduce(map, reduce, query = "{}", sort = "{}", limit = 0, out = NULL, scope = NULL) 
#>  $remove(query, just_one = FALSE) 
#>  $rename(name, db = NULL) 
#>  $replace(query, update = "{}", upsert = FALSE) 
#>  $run(command = "{\"ping\": 1}", simplify = TRUE) 
#>  $update(query, update = "{\"$set\":{}}", filters = NULL, upsert = FALSE, multiple = FALSE)

The R manual page for the mongo() function gives some brief descriptions as well.

The manual page tells us that mongo() supports the following arguments:

  • collection: name of the collection to connect to. Defaults to "test".
  • db: name of the database to connect to. Defaults to "test".
  • url: address of the MongoDB server in standard URI Format.
  • verbose: if TRUE, emits some extra output
  • options: additional connection options such as SSL keys/certs.

The url parameter contains a special URI format which defines the server address and additional connection options.

mongodb://[username:password@]host1[:port1][,host2[:port2],...[/[database][?options]]

The Mongo Connection String Manual gives an overview of the connection string syntax and options. Below the most important options for using mongolite.

2.1.1 DNS Seedlist Connection Format

New in mongolite 1.3 is support for seedlist URLs with the mongodb+srv:// prefix. This indicates that before connecting, the client should lookup the actual host addresses and parameters from the DNS SRV or TXT record.

#> List of 5
#>  $ nInserted  : num 32
#>  $ nMatched   : num 0
#>  $ nRemoved   : num 0
#>  $ nUpserted  : num 0
#>  $ writeErrors: list()

The DNS seedlist allows for using a short and fixed URL for clusters consisting of multiple or dynamic servers and parameters.

2.2 Authentication

MongoDB supports several authentication modes.

2.2.3 Kerberos

Note: Windows uses SSPI for Kerberos authentication. This section does not apply.

Kerberos authentication on Linux requires installation of a Kerberos client. On OS-X Kerberos is already installed by default. On Ubuntu/Debian we need:

sudo apt-get install krb5-user libsasl2-modules-gssapi-mit

Next, create or edit /etc/krbs5.conf and add our server under [realms] for example:

[realms]
  LDAPTEST.10GEN.CC = {
    kdc = ldaptest.10gen.cc
    admin_server = ldaptest.10gen.cc
  }

In a terminal run the following (only have to do this once)

kinit drivers@LDAPTEST.10GEN.CC
klist

We should now be able to connect in R:

2.3 SSH Tunnel

To connect to MongoDB via an SSH tunnel, you need to setup the tunnel separately with an SSH client. For example the mongolite manual contains this example:

Assume we want to tunnel through dev.opencpu.org which runs an SSH server on the standard port 22 with username jeroen. To initiate a tunnel from localhost:9999 to mongo.opencpu.org:43942 via the ssh server dev.opencpu.org:22, open a terminal and run:

ssh -L 9999:mongo.opencpu.org:43942 jeroen@dev.opencpu.org -vN -p22

Some relevant ssh flags:

  • -v (optional) show verbose status output
  • -f run the tunnel server in the background. Use pkill ssh to kill.
  • -p22 connect to ssh server on port 22 (default)
  • -i/some/path/id_rsa authenticate with ssh using a private key

Check man ssh for more ssh options It is also possible to run this command directly from R:

Once tunnel has been established, we can connect to our our ssh client which will tunnel traffic to our MongoDB server. In our example we run the ssh client on our localhost port 9999:

#> List of 5
#>  $ nInserted  : num 32
#>  $ nMatched   : num 0
#>  $ nRemoved   : num 0
#>  $ nUpserted  : num 0
#>  $ writeErrors: list()
#>           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells  576878 30.9    1164368 62.2         NA   892812 47.7
#> Vcells 1115296  8.6    8388608 64.0      16384  1768567 13.5

If you want to setup a tunnel client on Windows and you do not have the ssh program, you can an SSH client like putty to setup the tunnel. See this example.

2.4 SSL options

For security reasons, SSL options can not be configured in the URI but have to be set manually via the options parameter. The ssl_options function shows the default values:

#> List of 6
#>  $ pem_file              : NULL
#>  $ ca_file               : NULL
#>  $ ca_dir                : NULL
#>  $ crl_file              : NULL
#>  $ allow_invalid_hostname: logi FALSE
#>  $ weak_cert_validation  : logi FALSE

You can use this function to specify connection SSL options:

The MongoDB SSL client manual has more detailed descriptions on the various options.

2.5 Replica Options

The URI accepts a few special keys when connecting to a replicaset. The connection-string manual is the canonical source for all parameters. Most users should stick with the defaults here, only specify these if you know what you are doing.

2.5.1 Read Preference

The Read Preference parameter specifies if the client should connect to the primary node (default) or a secondary node in the replica set.

2.5.2 Write Concern

The Write Concern parameter is used to specify the level of acknowledgement that the write operation has propagated to a number of server nodes. The url string parameter is the letter w.

Note that specifying this parameter to 2 on a server that is not a replicaset will result in an error when trying to write:

#> Error: cannot use 'w' > 1 when a host is not replicated

2.5.3 Read Concern

Finally, Read Concern allows clients to choose a level of isolation for their reads from replica sets. The default value local returns the instance’s most recent data, but provides no guarantee that the data has been written to a majority of the replica set members (i.e. may be rolled back).

On the other hand, if we specify majority the server will only return data that has been propagated to the majority of nodes.

#> List of 6
#>  $ nInserted  : int 1
#>  $ nMatched   : int 0
#>  $ nModified  : int 0
#>  $ nRemoved   : int 0
#>  $ nUpserted  : int 0
#>  $ writeErrors: list()
#> List of 6
#>  $ nInserted  : int 1
#>  $ nMatched   : int 0
#>  $ nModified  : int 0
#>  $ nRemoved   : int 0
#>  $ nUpserted  : int 0
#>  $ writeErrors: list()

In the case of our local single-node server this is never the case. Therefore we see that the server does not return any data that meets the majority level.

#> [1] 2
#>   foo
#> 1 123
#> 2 456

The data is definitely there though, it just doesn’t meet the majority criterium. If we create a new connection with level local we do get to see our data:

#> [1] 2
#>   foo
#> 1 123
#> 2 456

2.6 Global options

Finally the mongo_options method allows for setting global client options that span across connections. Currently two options are supported:

  • log_level set the mongo log level, e.g. for printing debugging information.
  • bigint_as_char set to TRUE to parse int64 numbers as strings rather than doubles (R does not support large integers natively)

The default values are:

#> $log_level
#> [1] "INFO"
#> 
#> $bigint_as_char
#> [1] FALSE
#> 
#> $date_as_char
#> [1] FALSE

See the manual page for ?mongo_options for more details.