Chapter 3 Query Data
This chapter will cover basic techniques for reading data from MongoDB. To exemplify this chapter we start by creating a new collection diamonds and insert an example dataset from the ggplot2 package.
#> List of 5
#> $ nInserted : num 53940
#> $ nMatched : num 0
#> $ nRemoved : num 0
#> $ nUpserted : num 0
#> $ writeErrors: list()
The next chapter explains inserting data in more detail. For now let’s verify all our data was inserted:
#> [1] TRUE
Seems good!
3.1 Query syntax
MongoDB uses JSON based syntax to query documents. The empty query {}
means: select all data. The same query parameter is used for multiple operations such as find()
, iterate()
, count()
, remove()
and update()
. We need to specify the JSON query as a string in R.
#> [1] 53940
#> carat cut color clarity depth table price x y z
#> 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
#> 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
#> 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
#> 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
#> 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
#> 6 0.24 Very Good J VVS2 62.8 57.0 336 3.94 3.96 2.48
#> 7 0.24 Very Good I VVS1 62.3 57.0 336 3.95 3.98 2.47
#> 8 0.26 Very Good H SI1 61.9 55.0 337 4.07 4.11 2.53
#> 9 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49
#> 10 0.23 Very Good H VS1 59.4 61.0 338 4.00 4.05 2.39
#> [ reached getOption("max.print") -- omitted 53930 rows ]
To query all rows where cut == "Premium" AND price < 1000
you would run:
premium_diamonds <- dmd$find('{"cut" : "Premium", "price" : { "$lt" : 1000 } }')
print(premium_diamonds)
#> carat cut color clarity depth table price x y z
#> 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
#> 2 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
#> 3 0.22 Premium F SI1 60.4 61.0 342 3.88 3.84 2.33
#> 4 0.20 Premium E SI2 60.2 62.0 345 3.79 3.75 2.27
#> 5 0.32 Premium E I1 60.9 58.0 345 4.38 4.42 2.68
#> 6 0.24 Premium I VS1 62.5 57.0 355 3.97 3.94 2.47
#> 7 0.29 Premium F SI1 62.4 58.0 403 4.24 4.26 2.65
#> 8 0.22 Premium E VS2 61.6 58.0 404 3.93 3.89 2.41
#> 9 0.22 Premium D VS2 59.3 62.0 404 3.91 3.88 2.31
#> 10 0.30 Premium J SI2 59.3 61.0 405 4.43 4.38 2.61
#> [ reached getOption("max.print") -- omitted 3190 rows ]
We can confirm that we get the same subset in R:
#> [1] 3200
#> [1] 3200
To learn more about mongo data queries, study the Mongo Query Documents manual.
3.2 Filter fields
The fields
parameter filters specific columns from the output. Let’s continue our example above:
test <- dmd$find(
query = '{"cut" : "Premium", "price" : { "$lt" : 1000 } }',
fields = '{"cut" : true, "clarity" : true}',
limit = 5
)
print(test)
#> _id cut clarity
#> 1 5b6ca13c47a302fe131029dc Premium SI1
#> 2 5b6ca13c47a302fe131029de Premium VS2
#> 3 5b6ca13c47a302fe131029e7 Premium SI1
#> 4 5b6ca13c47a302fe131029e9 Premium SI2
#> 5 5b6ca13c47a302fe131029ea Premium I1
By default mongo always includes the id
field. To prevent this we need to explicitly disable it:
test <- dmd$find(
query = '{"cut" : "Premium", "price" : { "$lt" : 1000 } }',
fields = '{"cut" : true, "clarity" : true, "_id": false}',
limit = 5
)
print(test)
#> cut clarity
#> 1 Premium SI1
#> 2 Premium VS2
#> 3 Premium SI1
#> 4 Premium SI2
#> 5 Premium I1
The default value for the field
argument is '{"_id" : 0}'
i.e. all columns, except for _id
.
3.3 Sort and limit
The sort
parameter allows us to order the output, and limit
restricts the number records that will be returned. For example to return the 7 most expensive premium diamonds in the data we sort by price in descending order:
#> carat cut color clarity depth table price x y z
#> 1 2.29 Premium I VS2 60.8 60 18823 8.50 8.47 5.16
#> 2 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
#> 3 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
#> 4 2.00 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
#> 5 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.70
#> 6 2.05 Premium F SI2 60.2 59 18784 8.28 8.33 5.00
#> 7 2.55 Premium I VS1 61.8 62 18766 8.70 8.65 5.36
Note that usually you should only sort by fields that have an index set on them. Sorting by unindexed fields can be very slow and the server might reach the memory limits on the server.
3.4 Indexing
By default a collection only has an index for _id
, which means that selecting or sorting by any other field is relatively slow.
#> user system elapsed
#> 0.001 0.001 0.096
By adding an index, the field gets presorted and selecting or sorting it is almost immediate:
#> v key._id key.price name ns
#> 1 2 1 NA _id_ test.diamonds
#> 2 2 NA 1 price_1 test.diamonds
#> user system elapsed
#> 0.002 0.000 0.002
In order to speed up queries involving multiple fields, you need to add a cross-index which intersects both fields:
#> v key._id key.price key.depth name ns
#> 1 2 1 NA NA _id_ test.diamonds
#> 2 2 NA 1 NA price_1 test.diamonds
#> 3 2 NA NA 1 depth_1 test.diamonds
#> v key._id key.price key.depth name ns
#> 1 2 1 NA NA _id_ test.diamonds
#> 2 2 NA 1 NA price_1 test.diamonds
#> 3 2 NA NA 1 depth_1 test.diamonds
#> 4 2 NA 1 1 depth_1_price_1 test.diamonds
To remove indices from the collection, use the name
as listed above:
#> v key._id key.price key.depth name ns
#> 1 2 1 NA NA _id_ test.diamonds
#> 2 2 NA 1 NA price_1 test.diamonds
#> 3 2 NA NA 1 depth_1 test.diamonds
#> v key._id key.price name ns
#> 1 2 1 NA _id_ test.diamonds
#> 2 2 NA 1 price_1 test.diamonds
3.5 Iterating
The find()
method automatically simplifies the collection into a data frame, but sometimes you need more fine-grained control. The iterate()
method allows you to perform a query, but read the records one-by-one without simplification.
The iterator has methods one()
, batch(n)
which allow you to step through a single or n
records at the time. When the iterator is exhausted it will return NULL
. Lets run the same query as above using the iterator interface:
# perform query and return the iterator
it <- dmd$iterate('{"cut" : "Premium"}', sort = '{"price": -1}', limit = 7)
# read records from the iterator
while(!is.null(x <- it$one())){
cat(sprintf("Found %.2f carat diamond for $%d\n", x$carat, x$price))
}
#> Found 2.29 carat diamond for $18823
#> Found 2.29 carat diamond for $18797
#> Found 2.00 carat diamond for $18795
#> Found 2.04 carat diamond for $18795
#> Found 1.71 carat diamond for $18791
#> Found 2.05 carat diamond for $18784
#> Found 2.55 carat diamond for $18766
The iterator does not perform any simplification, so each x
is simply a named list containing the parsed JSON record.
3.6 Select by date
In order to query by timestamp we must make sure dates are in proper UTC datetime format. When inserting data via R this means
the column must be in POSIXct
type.
# Get some example data
mydata <- jsonlite::fromJSON("https://api.github.com/repos/jeroen/mongolite/issues")
mydata$title <- paste0(substr(mydata$title, 1, 40), "...")
mydata$created_at <- strptime(mydata$created_at, "%Y-%m-%dT%H:%M:%SZ", 'UTC')
mydata$closed_at <- strptime(mydata$closed_at, "%Y-%m-%dT%H:%M:%SZ", 'UTC')
# Insert into mongo
issues <- mongo("issues")
issues$insert(mydata)
#> List of 5
#> $ nInserted : num 9
#> $ nMatched : num 0
#> $ nRemoved : num 0
#> $ nUpserted : num 0
#> $ writeErrors: list()
Selecting by date is done via the "$date"
operator. For example to select dates which were created after January 1st, 2017:
issues$find(
query = '{"created_at": { "$gte" : { "$date" : "2017-01-01T00:00:00Z" }}}',
fields = '{"created_at" : true, "user.login" : true, "title":true, "_id": false}'
)
#> title login created_at
#> 1 Add an opt-in parser tailored to rectang... atheriel 2018-08-09 17:14:18
#> 2 Request for Query Modifiers... DavidHeslip 2018-07-24 12:23:27
#> 3 How do I create a unique index?... bidwbb 2018-04-19 19:55:22
#> 4 Remove mongolab references... rlondner 2018-01-11 20:17:44
#> 5 Query Issue with Aggregation ... koushiksaha89 2017-05-25 09:52:11
Note that confusingly, what R calls a Date is not a timestamp but rather a string which only contains the date (Y-M-D) part of a timestamp. This type cannot be queried in MongoDB.
See the MongoDB Extended JSON manual for details.
3.7 Select by ID
Each record inserted into MongoDB is automatically assigned an "_id"
value.
#> _id created_at
#> 1 5b6ca14047a302fe1310fc8f 2018-08-09 17:14:18
#> 2 5b6ca14047a302fe1310fc90 2018-07-24 12:23:27
#> 3 5b6ca14047a302fe1310fc91 2018-04-19 19:55:22
#> 4 5b6ca14047a302fe1310fc92 2018-01-11 20:17:44
#> 5 5b6ca14047a302fe1310fc93 2017-05-25 09:52:11
#> 6 5b6ca14047a302fe1310fc94 2016-05-05 14:22:03
#> 7 5b6ca14047a302fe1310fc95 2015-08-11 19:10:15
#> 8 5b6ca14047a302fe1310fc96 2015-05-02 05:51:40
#> 9 5b6ca14047a302fe1310fc97 2015-04-27 04:29:51
Use the {"$oid"}
operator (similar to ObjectId()
in mongo) to select a record by it’s ID, for example:
See the MongoDB Extended JSON manual for syntax details.