One of the most important tasks for maintaining good performance in a sharded MongoDB cluster is to ensure optimal chunk distribution across each cluster member.
I’m not going to talk about picking a good shard key for your collections because the MongoDB documentation on the subject is pretty good. But I will share with you a script that I use to get the chunk distribution for all databases on a MongoDB sharded cluster.
Context
I need to get the chunk distribution for all databases present on my MongoDB sharded cluster (version 3.4 using WiredTiger storage engine) in CSV format.
The report must contains the following informations:
- Database name
- Collection name
- Shard name
- Number of chunks per collection/shard
- Number of “jumbo chunks” per collection/shard (None, I hope…)
- Total size per collection/shard
- Number of records per collection/shard
One by one approach
You can get the desired informations with only one command:
> db.collection.getShardDistribution()
The main issue with this approach is that the command does not return valid JSON output. In addition, you must run the command manually, one collection at a time.
Example
Connect to your MongoDB sharded cluster through mongos:
mongo --host mongos01:27017 --authenticationDatabase admin -uadmin -padmin
Display shard distribution for one collection (output is truncated):
mongos> use arotest mongos> db.pokemon.getShardDistribution() Shard rs01 at rs01/rs01secondary:27020,rs01primary:27018 data : 13KiB docs : 30 chunks : 2 estimated data per chunk : 6KiB estimated docs per chunk : 15 Shard rs02 at rs02/rs02primary:27018,rs02secondary:27020 data : 9KiB docs : 22 chunks : 2 estimated data per chunk : 4KiB estimated docs per chunk : 11 [...] Totals data : 66KiB docs : 149 chunks : 12 Shard rs01 contains 20.09% data, 20.13% docs in cluster, avg obj size on shard : 453B Shard rs02 contains 14.35% data, 14.76% docs in cluster, avg obj size on shard : 441B [...]
Very difficult to parse in order to provide a CSV report…
Fortunately, we can reach our goal in another way!
Javascript approach
As suggested by the title, we need to use some Javascript code. In addition, we will use MongoDB aggregation framework on the “config.chunks” collection which stores a document for each chunk in the cluster.
The script consists of two parts: 1 part for the collection of information and 1 part for the generation of the CSV file.
First part: collect
We connect to the MongoDB sharded cluster through mongos:
mongo --host mongos01:27017 --authenticationDatabase admin -uadmin -padmin
And switch to “config” database:
mongos> use config switched to db config
Then we declare a new Javascript object (“result”) that will hold the information and execute the script.
It can take some times to execute, according to the number of chunks in your cluster.
mongos> var result=new Object(); mongos> db.chunks.aggregate( [ { "$group" : { _id : { collection: "$ns", shard: "$shard"}, chunks:{$sum:1}, jumbo:{$sum:"$jumbo"}, } }, { "$sort" : { "_id.collection" : 1, "_id.shard" : 1 } } ], { cursor: { batchSize: 0 } } ).forEach(function(j){ var ns=j["_id"]["collection"]; var database=ns.split(".")[0]; if (typeof(result[database]) === 'undefined') {result[database]=new Object();} var collection=ns.split(".")[1]; if (typeof(result[database][collection]) === 'undefined') {result[database][collection]=new Object();} var shard=j["_id"]["shard"]; if (typeof(result[database][collection][shard]) === 'undefined') {result[database][collection][shard]=new Object();} var chunks=j["chunks"]; var jumbo=j["jumbo"]; var size=(db.getSiblingDB(database).getCollection(collection).stats().shards[shard].size/1024/1024).toFixed(2); var numrows=db.getSiblingDB(database).getCollection(collection).stats().shards[shard].count; result[database][collection][shard]["chunks"]=chunks; result[database][collection][shard]["jumbo"]=jumbo; result[database][collection][shard]["size"]=size; result[database][collection][shard]["numrows"]=numrows; });
The script does not return anything. We will process the data before displaying it in CSV format.
Don’t close your MongoDB connection yet!
Second part: process & display
If you want output in JSON format, you can just print the “result” object:
mongos> printjson(result)
But if you want CSV format, you can use this script:
mongos> for(var databases in result) { var database=result[databases]; for(var collections in database) { var collection=database[collections] for(var shards in collection) { var shard=collection[shards] print(databases+";"+collections+";"+shards+";"+shard.chunks+";"+shard.jumbo+";"+shard.size+";"+shard.numrows); } } }
The output looks like this:
arotest;pokemon;rs01;2;0;0.01;30 arotest;pokemon;rs02;2;0;0.01;22 arotest;pokemon;rs03;2;0;0.01;25 arotest;pokemon;rs04;2;0;0.01;19 arotest;pokemon;rs05;2;0;0.01;25 arotest;pokemon;rs06;2;0;0.01;28 arotest;test;rs01;2;0;54.61;177060 arotest;test;rs02;2;0;54.62;177080 arotest;test;rs03;2;0;54.40;176350 arotest;test;rs04;2;0;54.59;176987 arotest;test;rs05;2;0;54.62;177079 arotest;test;rs06;2;0;54.42;176444
You just have to add a heading line if needed:”database;collection;shard;number of chunks;number of jumbo chunks;data size;number of rows”.
Once imported in Excel:
The “data size” field is expressed in megabytes. It does not reflect the amount of storage used by your data, but the size in memory of all records in the shard.
If you need the storage size, you can modify the first script by replacing:
var size=(db.getSiblingDB(database).getCollection(collection).stats().shards[shard].size/1024/1024).toFixed(2);
with
var size=(db.getSiblingDB(database).getCollection(collection).stats().shards[shard].storageSize/1024/1024).toFixed(2);
Have a nice day, and stay tuned for more DBA stuff!