Chapter 3 Query Data

This chapter will cover basic techniques for reading data from MongoDB. To exemplify this chapter we start by creating a new collection diamonds and insert an example dataset from the ggplot2 package.

# create collection with example data
dmd <- mongo("diamonds")
dmd$insert(ggplot2::diamonds)

#> List of 5
#>  $ nInserted  : num 53940
#>  $ nMatched   : num 0
#>  $ nRemoved   : num 0
#>  $ nUpserted  : num 0
#>  $ writeErrors: list()

The next chapter explains inserting data in more detail. For now let’s verify all our data was inserted:

dmd$count() == nrow(ggplot2::diamonds)

#> [1] TRUE

Seems good!

3.1 Query syntax

MongoDB uses JSON based syntax to query documents. The empty query {} means: select all data. The same query parameter is used for multiple operations such as find(), iterate(), count(), remove() and update(). We need to specify the JSON query as a string in R.

# Get all records
dmd$count('{}')

#> [1] 53940

# Read all the data back into R
alldata <- dmd$find('{}')
print(alldata)

#>       carat       cut color clarity depth table price     x     y     z
#> 1      0.23     Ideal     E     SI2  61.5  55.0   326  3.95  3.98  2.43
#> 2      0.21   Premium     E     SI1  59.8  61.0   326  3.89  3.84  2.31
#> 3      0.23      Good     E     VS1  56.9  65.0   327  4.05  4.07  2.31
#> 4      0.29   Premium     I     VS2  62.4  58.0   334  4.20  4.23  2.63
#> 5      0.31      Good     J     SI2  63.3  58.0   335  4.34  4.35  2.75
#> 6      0.24 Very Good     J    VVS2  62.8  57.0   336  3.94  3.96  2.48
#> 7      0.24 Very Good     I    VVS1  62.3  57.0   336  3.95  3.98  2.47
#> 8      0.26 Very Good     H     SI1  61.9  55.0   337  4.07  4.11  2.53
#> 9      0.22      Fair     E     VS2  65.1  61.0   337  3.87  3.78  2.49
#> 10     0.23 Very Good     H     VS1  59.4  61.0   338  4.00  4.05  2.39
#>  [ reached getOption("max.print") -- omitted 53930 rows ]

To query all rows where cut == "Premium" AND price < 1000 you would run:

premium_diamonds <- dmd$find('{"cut" : "Premium", "price" : { "$lt" : 1000 } }')
print(premium_diamonds)

#>      carat     cut color clarity depth table price    x    y    z
#> 1     0.21 Premium     E     SI1  59.8  61.0   326 3.89 3.84 2.31
#> 2     0.29 Premium     I     VS2  62.4  58.0   334 4.20 4.23 2.63
#> 3     0.22 Premium     F     SI1  60.4  61.0   342 3.88 3.84 2.33
#> 4     0.20 Premium     E     SI2  60.2  62.0   345 3.79 3.75 2.27
#> 5     0.32 Premium     E      I1  60.9  58.0   345 4.38 4.42 2.68
#> 6     0.24 Premium     I     VS1  62.5  57.0   355 3.97 3.94 2.47
#> 7     0.29 Premium     F     SI1  62.4  58.0   403 4.24 4.26 2.65
#> 8     0.22 Premium     E     VS2  61.6  58.0   404 3.93 3.89 2.41
#> 9     0.22 Premium     D     VS2  59.3  62.0   404 3.91 3.88 2.31
#> 10    0.30 Premium     J     SI2  59.3  61.0   405 4.43 4.38 2.61
#>  [ reached getOption("max.print") -- omitted 3190 rows ]

We can confirm that we get the same subset in R:

nrow(premium_diamonds)

#> [1] 3200

nrow(subset(ggplot2::diamonds, cut == "Premium" & price < 1000))

#> [1] 3200

To learn more about mongo data queries, study the Mongo Query Documents manual.

3.2 Filter fields

The fields parameter filters specific columns from the output. Let’s continue our example above:

test <- dmd$find(
  query = '{"cut" : "Premium", "price" : { "$lt" : 1000 } }', 
  fields = '{"cut" : true, "clarity" : true}',
  limit = 5
)
print(test)

#>                        _id     cut clarity
#> 1 5b6ca13c47a302fe131029dc Premium     SI1
#> 2 5b6ca13c47a302fe131029de Premium     VS2
#> 3 5b6ca13c47a302fe131029e7 Premium     SI1
#> 4 5b6ca13c47a302fe131029e9 Premium     SI2
#> 5 5b6ca13c47a302fe131029ea Premium      I1

By default mongo always includes the id field. To prevent this we need to explicitly disable it:

test <- dmd$find(
  query = '{"cut" : "Premium", "price" : { "$lt" : 1000 } }', 
  fields = '{"cut" : true, "clarity" : true, "_id": false}',
  limit = 5
)
print(test)

#>       cut clarity
#> 1 Premium     SI1
#> 2 Premium     VS2
#> 3 Premium     SI1
#> 4 Premium     SI2
#> 5 Premium      I1

The default value for the field argument is '{"_id" : 0}' i.e. all columns, except for _id.

3.3 Sort and limit

The sort parameter allows us to order the output, and limit restricts the number records that will be returned. For example to return the 7 most expensive premium diamonds in the data we sort by price in descending order:

dmd$find('{"cut" : "Premium"}', sort = '{"price": -1}', limit = 7)

#>   carat     cut color clarity depth table price    x    y    z
#> 1  2.29 Premium     I     VS2  60.8    60 18823 8.50 8.47 5.16
#> 2  2.29 Premium     I     SI1  61.8    59 18797 8.52 8.45 5.24
#> 3  2.04 Premium     H     SI1  58.1    60 18795 8.37 8.28 4.84
#> 4  2.00 Premium     I     VS1  60.8    59 18795 8.13 8.02 4.91
#> 5  1.71 Premium     F     VS2  62.3    59 18791 7.57 7.53 4.70
#> 6  2.05 Premium     F     SI2  60.2    59 18784 8.28 8.33 5.00
#> 7  2.55 Premium     I     VS1  61.8    62 18766 8.70 8.65 5.36

Note that usually you should only sort by fields that have an index set on them. Sorting by unindexed fields can be very slow and the server might reach the memory limits on the server.

3.4 Indexing

By default a collection only has an index for _id, which means that selecting or sorting by any other field is relatively slow.

system.time(dmd$find(sort = '{"price" : 1}', limit = 1))

#>    user  system elapsed 
#>   0.001   0.001   0.096

By adding an index, the field gets presorted and selecting or sorting it is almost immediate:

dmd$index(add = '{"price" : 1}')

#>   v key._id key.price    name            ns
#> 1 2       1        NA    _id_ test.diamonds
#> 2 2      NA         1 price_1 test.diamonds

system.time(dmd$find(sort = '{"price" : 1}', limit = 1))

#>    user  system elapsed 
#>   0.002   0.000   0.002

In order to speed up queries involving multiple fields, you need to add a cross-index which intersects both fields:

dmd$index(add = '{"depth" : 1}')

#>   v key._id key.price key.depth    name            ns
#> 1 2       1        NA        NA    _id_ test.diamonds
#> 2 2      NA         1        NA price_1 test.diamonds
#> 3 2      NA        NA         1 depth_1 test.diamonds

dmd$index(add = '{"depth" : 1, "price" : 1}')

#>   v key._id key.price key.depth            name            ns
#> 1 2       1        NA        NA            _id_ test.diamonds
#> 2 2      NA         1        NA         price_1 test.diamonds
#> 3 2      NA        NA         1         depth_1 test.diamonds
#> 4 2      NA         1         1 depth_1_price_1 test.diamonds

To remove indices from the collection, use the name as listed above:

dmd$index(remove = 'depth_1_price_1')

#>   v key._id key.price key.depth    name            ns
#> 1 2       1        NA        NA    _id_ test.diamonds
#> 2 2      NA         1        NA price_1 test.diamonds
#> 3 2      NA        NA         1 depth_1 test.diamonds

dmd$index(remove = 'depth_1')

#>   v key._id key.price    name            ns
#> 1 2       1        NA    _id_ test.diamonds
#> 2 2      NA         1 price_1 test.diamonds

3.5 Iterating

The find() method automatically simplifies the collection into a data frame, but sometimes you need more fine-grained control. The iterate() method allows you to perform a query, but read the records one-by-one without simplification.

The iterator has methods one(), batch(n) which allow you to step through a single or n records at the time. When the iterator is exhausted it will return NULL. Lets run the same query as above using the iterator interface:

# perform query and return the iterator
it <- dmd$iterate('{"cut" : "Premium"}', sort = '{"price": -1}', limit = 7)

# read records from  the iterator
while(!is.null(x <- it$one())){
  cat(sprintf("Found %.2f carat diamond for $%d\n", x$carat, x$price))
}

#> Found 2.29 carat diamond for $18823
#> Found 2.29 carat diamond for $18797
#> Found 2.00 carat diamond for $18795
#> Found 2.04 carat diamond for $18795
#> Found 1.71 carat diamond for $18791
#> Found 2.05 carat diamond for $18784
#> Found 2.55 carat diamond for $18766

The iterator does not perform any simplification, so each x is simply a named list containing the parsed JSON record.

3.6 Select by date

In order to query by timestamp we must make sure dates are in proper UTC datetime format. When inserting data via R this means the column must be in POSIXct type.

# Get some example data
mydata <- jsonlite::fromJSON("https://api.github.com/repos/jeroen/mongolite/issues")
mydata$title <- paste0(substr(mydata$title, 1, 40), "...")
mydata$created_at <- strptime(mydata$created_at, "%Y-%m-%dT%H:%M:%SZ", 'UTC')
mydata$closed_at <- strptime(mydata$closed_at, "%Y-%m-%dT%H:%M:%SZ", 'UTC')

# Insert into mongo
issues <- mongo("issues")
issues$insert(mydata)

#> List of 5
#>  $ nInserted  : num 9
#>  $ nMatched   : num 0
#>  $ nRemoved   : num 0
#>  $ nUpserted  : num 0
#>  $ writeErrors: list()

Selecting by date is done via the "$date" operator. For example to select dates which were created after January 1st, 2017:

issues$find(
  query = '{"created_at": { "$gte" : { "$date" : "2017-01-01T00:00:00Z" }}}',
  fields = '{"created_at" : true, "user.login" : true, "title":true, "_id": false}'
)

#>                                         title         login          created_at
#> 1 Add an opt-in parser tailored to rectang...      atheriel 2018-08-09 17:14:18
#> 2              Request for Query Modifiers...   DavidHeslip 2018-07-24 12:23:27
#> 3          How do I create a unique index?...        bidwbb 2018-04-19 19:55:22
#> 4               Remove mongolab references...      rlondner 2018-01-11 20:17:44
#> 5            Query Issue with Aggregation ... koushiksaha89 2017-05-25 09:52:11

Note that confusingly, what R calls a Date is not a timestamp but rather a string which only contains the date (Y-M-D) part of a timestamp. This type cannot be queried in MongoDB.

See the MongoDB Extended JSON manual for details.

3.7 Select by ID

Each record inserted into MongoDB is automatically assigned an "_id" value.

issues$find(fields= '{"created_at":true, "_id":true}', limit = 10)

#>                        _id          created_at
#> 1 5b6ca14047a302fe1310fc8f 2018-08-09 17:14:18
#> 2 5b6ca14047a302fe1310fc90 2018-07-24 12:23:27
#> 3 5b6ca14047a302fe1310fc91 2018-04-19 19:55:22
#> 4 5b6ca14047a302fe1310fc92 2018-01-11 20:17:44
#> 5 5b6ca14047a302fe1310fc93 2017-05-25 09:52:11
#> 6 5b6ca14047a302fe1310fc94 2016-05-05 14:22:03
#> 7 5b6ca14047a302fe1310fc95 2015-08-11 19:10:15
#> 8 5b6ca14047a302fe1310fc96 2015-05-02 05:51:40
#> 9 5b6ca14047a302fe1310fc97 2015-04-27 04:29:51

Use the {"$oid"} operator (similar to ObjectId() in mongo) to select a record by it’s ID, for example:

issues$find(query = '{"_id" : {"$oid":"58a83a6aba7cb2070210433e"}}')

See the MongoDB Extended JSON manual for syntax details.