Choosing a shard key can be difficult, and the factors involved largely depend on your use case. In fact, there is no such thing as a perfect shard key; there are design tradeoffs inherent in every decision. This presentation goes through those tradeoffs, as well as the different types of shard keys available in MongoDB, such as hashed and compound shard keys
3. Single Table En
Agenda
• Why is schema design important
• 4 Real World Schemas
– Inbox
– History
– IndexedAttributes
– Multiple Identities
• Conclusions
4. Why is Schema Design
important?
• Largest factor for a performant system
• Schema design with MongoDB is different
• RDBMS – "What answers do I have?"
• MongoDB – "What question will I have?"
10. 3 Approaches (there are
more)
• Fan out on Read
• Fan out on Write
• Fan out on Write with Bucketing
11. // Shard on "from"
db.shardCollection( ”oscon.inbox", { from: 1 } )
// Make sure we have an index to handle inbox reads
db.inbox.ensureIndex( { to: 1, sent: 1 } )
msg = {
from: "Joe",
to: [ "Bob", "Jane" ],
sent: new Date(),
message: "Hi!",
}
// Send a message
db.inbox.save( msg )
// Read my inbox
db.inbox.find( { to: "Joe" } ).sort( { sent: -1 } )
Fan out on read
12. Fan out on read – I/O
Shard
1 Shard 2
Shard
3
Send
Message
13. Fan out on read – I/O
Shard
1 Shard 2
Shard
3
Read
Inbox
Send
Message
14. Considerations
• Write: One document per message sent
• Read: Find all messages with my own name in
the recipient field
• Read: Requires scatter-gather on sharded
cluster
• A lot of random I/O on a shard to find everything
15. // Shard on “recipient” and “sent”
db.shardCollection( ”oscon.inbox", { ”recipient”: 1, ”sent”: 1 } )
msg = {
from: "Joe",
to: [ "Bob", "Jane" ],
sent: new Date(),
message: "Hi!",
}
// Send a message
for (var i = 0; i < msg.to.length; i++) {
msg.recipient = msg.to[i]
db.inbox.save( msg );
}
// Read my inbox
db.inbox.find( { recipient: "Joe" } ).sort( { sent: -1 } )
Fan out on write
16. Fan out on write – I/O
Shard
1
Shard
2
Shard
3
Send
Message
17. Fan out on write – I/O
Read
Inbox
Send
Message
Shard
1
Shard
2
Shard
3
18. Considerations
• Write: One document per recipient
• Read: Find all of the messages with me as the
recipient
• Can shard on recipient, so inbox reads hit one
shard
• But still lots of random I/O on the shard
19. // Shard on "owner / sequence"
db.shardCollection( ”oscon.inbox",
{ owner: 1, sequence: 1 } )
db.shardCollection( ”oscon.users", { user_name: 1 } )
msg = {
from: "Joe",
to: [ "Bob", "Jane" ],
sent: new Date(),
message: "Hi!",
}
Fan out on write with buckets
21. Fan out on write with buckets
• Each “inbox” document is an array of messages
• Append a message onto “inbox” of recipient
• Bucket inboxes so there’s not too many
messages per document
• Can shard on recipient, so inbox reads hit one
shard
• 1 or 2 documents to read the whole inbox
22. Fan out on write with buckets – I/O
Shard
1
Shard
2
Shard
3
Send
Message
26. Design Goals
• Need to retain a limited amount of history e.g.
– Hours, Days, Weeks
– May be legislative requirement (e.g. HIPPA, SOX, DPA)
• Need to query efficiently by
– match
– ranges
27. 3 Approaches (there are
more)
• Bucket by Number of messages
• Fixed size array
• Bucket by date + TTL collections
28. db.inbox.find()
{ owner: "Joe", sequence: 25,
messages: [
{ from: "Joe",
to: [ "Bob", "Jane" ],
sent: ISODate("2013-03-01T09:59:42.689Z"),
message: "Hi!"
},
…
] }
// Query with a date range
db.inbox.find ({owner: "friend1",
messages: {
$elemMatch: {sent:{$gte: ISODate("…") }}}})
// Remove elements based on a date
db.inbox.update({owner: "friend1" },
{ $pull: { messages: {
sent: { $gte: ISODate("…") } } } } )
Bucket by number of
messages
29. Considerations
• Shrinking documents, space can be reclaimed
with
– db.runCommand ( { compact: '<collection>' } )
• Removing the document after the last element in
the array as been removed
– { "_id" : …, "messages" : [ ], "owner" : "friend1",
"sequence" : 0 }
32. // messages: one doc per user per day
db.inbox.findOne()
{
_id: 1,
to: "Joe",
sequence: ISODate("2013-02-04T00:00:00.392Z"),
messages: [ ]
}
// Auto expires data after 31536000 seconds = 1 year
db.messages.ensureIndex( { sequence: 1 },
{ expireAfterSeconds: 31536000 } )
TTL Collections
34. Design Goal
• Application needs to stored a variable number of
attributes e.g.
– User defined Form
– Meta Data tags
• Queries needed
– Equality
– Range based
• Need to be efficient, regardless of the number of
attributes
35. 2 Approaches (there are
more)
• Attributes as Embedded Document
• Attributes as Objects in an Array
36. db.files.insert( { _id: "local.0",
attr: { type: "text", size: 64,
created: ISODate("..." } } )
db.files.insert( { _id: "local.1",
attr: { type: "text", size: 128} } )
db.files.insert( { _id: "mongod",
attr: { type: "binary", size: 256,
created: ISODate("...") } } )
// Need to create an index for each item in the sub-document
db.files.ensureIndex( { "attr.type": 1 } )
db.files.find( { "attr.type": "text"} )
// Can perform range queries
db.files.ensureIndex( { "attr.size": 1 } )
db.files.find( { "attr.size": { $gt: 64, $lte: 16384 } } )
Attributes as a Sub-
Document
46. Considerations
• Lookup by shard key is routed to 1 shard
• Lookup by other identifier is scatter gathered
across all shards
• Secondary keys cannot have a unique index
49. Considerations
• Lookup to Identities is a routed query
• Lookup to Users is a routed query
• Unique indexes available
• Must do two queries per lookup
51. Summary
• Multiple ways to model a domain problem
• Understand the key uses cases of your app
• Balance between ease of query vs. ease of write
• Random I/O should be avoided