Using JOINs in MongoDB NoSQL Databases

Thanks to Julian Motz for kindly helping to peer review this article.

One of the biggest differences between SQL and NoSQL databases is JOIN. In relational databases, the SQL JOIN clause allows you to combine rows from two or more tables using a common field between them. For example, if you have tables of books and publishers, you can write SQL commands such as:

SELECT book.title, publisher.name
FROM book
LEFT JOIN book.publisher_id ON publisher.id;

In other words, the book table has a publisher_id field which references the id field in the publisher table.

This is practical, since a single publisher could offer thousands of books. If we ever need to update a publisher’s details, we can change a single record. Data redundancy is minimized, since we don’t need to repeat the publisher information for every book. The technique is known as normalization.

SQL databases offer a range of normalization and constraint features to ensure relationships are maintained.

NoSQL == No JOIN?

Not always …

Document-oriented databases such as MongoDB are designed to store denormalized data. Ideally, there should be no relationship between collections. If the same data is required in two or more documents, it must be repeated.

This can be frustrating, since there are few situations where you never need relational data. Fortunately, MongoDB 3.2 introduces a new $lookup operator which can perform a LEFT-OUTER-JOIN-like operation on two or more collections. But there’s a catch …

MongoDB Aggregation

$lookup is only permitted in aggregation operations. Think of these as a pipeline of operators which query, filter and group a result. The output of one operator is used as the input for the next.

Aggregation is more difficult to understand than simpler find queries and will generally run slower. However, they are powerful and an invaluable option for complex search operations.

Aggregation is best explained with an example. Presume we’re creating a social media platform with a user collection. It stores every user’s details in separate documents. For example:

{
  "_id": ObjectID("45b83bda421238c76f5c1969"),
  "name": "User One",
  "email: "userone@email.com",
  "country": "UK",
  "dob": ISODate("1999-09-13T00:00:00.000Z")
}

We can add as many fields as necessary, but all MongoDB documents require an _id field which has a unique value. The _id is similar to an SQL primary key, and will be inserted automatically if necessary.

Our social network now requires a post collection, which stores numerous insightful updates from users. The documents store the text, date, a rating and a reference to the user who wrote it in a user_id field:

{
  "_id": ObjectID("17c9812acff9ac0bba018cc1"),
  "user_id": ObjectID("45b83bda421238c76f5c1969"),
  "date: ISODate("2016-09-05T03:05:00.123Z"),
  "text": "My life story so far",
  "rating": "important"
}

We now want to show the last twenty posts with an “important” rating from all users in reverse chronological order. Each returned document should contain the text, the time of the post and the associated user’s name and country.

The MongoDB aggregate query is passed an array of pipeline operators which define each operation in order. First, we need to extract all documents from the post collection which have the correct rating using the $match filter:

{ "$match": { "rating": "important" } }

We must now sort the matched items into reverse date order using the $sort operator:

{ "$sort": { "date": -1 } }

Since we only require twenty posts, we can apply a $limit stage so MongoDB only needs to process data we want:

{ "$limit": 20 }

We can now join data from the user collection using the new $lookup operator. It requires an object with four parameters:

localField: the lookup field in the input document
from: the collection to join
foreignField: the field to lookup in the from collection
as: the name of the output field.

Our operator is therefore:

{ "$lookup": {
  "localField": "user_id",
  "from": "user",
  "foreignField": "_id",
  "as": "userinfo"
} }

This will create a new field in our output named userinfo. It contains an array where each value is the matching the user document:

"userinfo": [
  { "name": "User One", ... }
]

We have a one-to-one relationship between the post.user_id and user._id, since a post can only have one author. Therefore, our userinfo array will only ever contain one item. We can use the $unwind operator to deconstruct it into a sub-document:

{ "$unwind": "$userinfo" }

The output will now be converted to a more practical format which can have further operators applied:

"userinfo": {
  "name": "User One",
  "email: "userone@email.com",
  …
}

Finally, we can return the text, the time of the post, the user’s name and country using a $project stage in the pipeline:

{ "$project": {
  "text": 1,
  "date": 1,
  "userinfo.name": 1,
  "userinfo.country": 1
} }

Putting It All Together

Our final aggregate query matches posts, sorts into order, limits to the latest twenty items, joins user data, flattens the user array and returns necessary fields only. The full command:

db.post.aggregate([
  { "$match": { "rating": "important" } },
  { "$sort": { "date": -1 } },
  { "$limit": 20 },
  { "$lookup": {
    "localField": "user_id",
    "from": "user",
    "foreignField": "_id",
    "as": "userinfo"
  } },
  { "$unwind": "$userinfo" },
  { "$project": {
    "text": 1,
    "date": 1,
    "userinfo.name": 1,
    "userinfo.country": 1
  } }
]);

The result is a collection of up to twenty documents. For example:

[
  {
    "text": "The latest post",
    "date: ISODate("2016-09-27T00:00:00.000Z"),
    "userinfo": {
      "name": "User One",
      "country": "UK"
    }
  },
  {
    "text": "Another post",
    "date: ISODate("2016-09-26T00:00:00.000Z"),
    "userinfo": {
      "name": "User One",
      "country": "UK"
    }
  }
  ...
]

Great! I Can Finally Switch to NoSQL!

MongoDB $lookup is useful and powerful, but even this basic example requires a complex aggregate query. It’s not a substitute for the more powerful JOIN clause offered in SQL. Neither does MongoDB offer constraints; if a user document is deleted, orphan post documents would remain.

Ideally, the $lookup operator should be required infrequently. If you need it a lot, you’re possibly using the wrong data store …

If you have relational data, use a relational (SQL) database!

That said, $lookup is a welcome addition to MongoDB 3.2. It can overcome some of the more frustrating issues when using small amounts of relational data in a NoSQL database.

Frequently Asked Questions (FAQs) about Using Joins in MongoDB NoSQL Databases

What is the difference between SQL joins and MongoDB joins?

In SQL databases, a join operation combines rows from two or more tables based on a related column between them. However, MongoDB, being a NoSQL database, doesn’t support traditional SQL joins. Instead, MongoDB provides two methods to perform a similar operation: the $lookup stage and the $graphLookup stage in aggregation. These methods allow you to combine data from multiple collections into a single result set.

How does the $lookup stage work in MongoDB?

The $lookup stage in MongoDB lets you join documents from another collection (the “joined” collection), adding the joined documents to the input documents. The $lookup stage specifies the “from” collection, the “localField” and “foreignField” for matching documents, and the “as” field for output documents. It’s similar to a left outer join in SQL, returning all documents from the input collection and the matching documents from the “from” collection.

Can I perform recursive search with MongoDB joins?

Yes, MongoDB provides the $graphLookup stage for recursive search. The $graphLookup stage performs a recursive search on a specified collection, with options to restrict the depth and breadth of the search. It’s useful for querying hierarchical data or graphs where the number of levels is unknown or can vary.

How can I optimize performance when using MongoDB joins?

To optimize performance when using MongoDB joins, consider the following strategies: use indexes on the “localField” and “foreignField” to speed up the matching process; limit the number of documents in the “from” collection; and use the $match and $project stages to filter and transform documents before the $lookup stage.

Can I join more than two collections in MongoDB?

Yes, you can join more than two collections in MongoDB by chaining multiple $lookup stages in the aggregation pipeline. Each $lookup stage adds the joined documents from another collection to the input documents.

How can I handle null or missing values when using MongoDB joins?

When using MongoDB joins, if a document in the input collection does not match any document in the “from” collection, the $lookup stage adds an empty array to the “as” field. You can handle these null or missing values by adding a $match stage after the $lookup stage to filter out documents with an empty “as” field.

Can I use MongoDB joins with sharded collections?

As of MongoDB 3.6, the $lookup and $graphLookup stages can accept a sharded collection as the “from” collection. However, the performance may be less optimal than with non-sharded collections due to the extra network overhead.

How can I sort the joined documents in MongoDB?

You can sort the joined documents in MongoDB by adding a $sort stage after the $lookup stage in the aggregation pipeline. The $sort stage sorts the documents by a specified field in ascending or descending order.

Can I use MongoDB joins with the find() method?

No, MongoDB joins are not available with the find() method. The $lookup and $graphLookup stages are part of the aggregation framework, which provides more advanced data processing capabilities than the find() method.

How can I debug or troubleshoot MongoDB joins?

To debug or troubleshoot MongoDB joins, you can use the explain() method to analyze the execution plan of the aggregation pipeline. The explain() method provides detailed information about the stages, including the number of documents processed, the time taken, and the use of indexes.