Get chunk distribution for all databases in a MongoDB cluster

One of the most important tasks for maintaining good performance in a sharded MongoDB cluster is to ensure optimal chunk distribution across each cluster member.

I’m not going to talk about picking a good shard key for your collections because the MongoDB documentation on the subject is pretty good. But I will share with you a script that I use to get the chunk distribution for all databases on a MongoDB sharded cluster.

Context

I need to get the chunk distribution for all databases present on my MongoDB sharded cluster (version 3.4 using WiredTiger storage engine) in CSV format.

The report must contains the following informations:

  • Database name
  • Collection name
  • Shard name
  • Number of chunks per collection/shard
  • Number of “jumbo chunks” per collection/shard (None, I hope…)
  • Total size per collection/shard
  • Number of records per collection/shard

 

One by one approach

You can get the desired informations with only one command:

> db.collection.getShardDistribution()

The main issue with this approach is that the command does not return valid JSON output. In addition, you must run the command manually, one collection at a time.

Example

Connect to your MongoDB sharded cluster through mongos:

mongo --host mongos01:27017 --authenticationDatabase admin -uadmin -padmin

Display shard distribution for one collection (output is truncated):

mongos> use arotest
mongos> db.pokemon.getShardDistribution()

Shard rs01 at rs01/rs01secondary:27020,rs01primary:27018
 data : 13KiB docs : 30 chunks : 2
 estimated data per chunk : 6KiB
 estimated docs per chunk : 15

Shard rs02 at rs02/rs02primary:27018,rs02secondary:27020
 data : 9KiB docs : 22 chunks : 2
 estimated data per chunk : 4KiB
 estimated docs per chunk : 11
[...]
Totals
 data : 66KiB docs : 149 chunks : 12
 Shard rs01 contains 20.09% data, 20.13% docs in cluster, avg obj size on shard : 453B
 Shard rs02 contains 14.35% data, 14.76% docs in cluster, avg obj size on shard : 441B
[...]

Very difficult to parse in order to provide a CSV report…

Fortunately, we can reach our goal in another way!
 

Javascript approach

As suggested by the title, we need to use some Javascript code. In addition, we will use MongoDB aggregation framework on the “config.chunks” collection which stores a document for each chunk in the cluster.

The script consists of two parts: 1 part for the collection of information and 1 part for the generation of the CSV file.

First part: collect

We connect to the MongoDB sharded cluster through mongos:

mongo --host mongos01:27017 --authenticationDatabase admin -uadmin -padmin

And switch to “config” database:

mongos> use config
switched to db config

Then we declare a new Javascript object (“result”) that will hold the information and execute the script.
It can take some times to execute, according to the number of chunks in your cluster.

mongos> var result=new Object();
mongos> db.chunks.aggregate(
	[
		{
			"$group" :
			{
				_id : { collection: "$ns", shard: "$shard"},
				chunks:{$sum:1},
				jumbo:{$sum:"$jumbo"},
			}
		},
		{
			"$sort" :
			{
				"_id.collection" : 1,
				"_id.shard" : 1
			}
		}
	],
	{
		cursor: { batchSize: 0 }
	}
).forEach(function(j){
	var ns=j["_id"]["collection"];
	var database=ns.split(".")[0];
	if (typeof(result[database]) === 'undefined')
		{result[database]=new Object();}
	var collection=ns.split(".")[1];
	if (typeof(result[database][collection]) === 'undefined')
		{result[database][collection]=new Object();}
	var shard=j["_id"]["shard"];
	if (typeof(result[database][collection][shard]) === 'undefined')
		{result[database][collection][shard]=new Object();}
	var chunks=j["chunks"];
	var jumbo=j["jumbo"];
	var size=(db.getSiblingDB(database).getCollection(collection).stats().shards[shard].size/1024/1024).toFixed(2);
	var numrows=db.getSiblingDB(database).getCollection(collection).stats().shards[shard].count;
	result[database][collection][shard]["chunks"]=chunks;
	result[database][collection][shard]["jumbo"]=jumbo;
	result[database][collection][shard]["size"]=size;
	result[database][collection][shard]["numrows"]=numrows;
});

The script does not return anything. We will process the data before displaying it in CSV format.
Don’t close your MongoDB connection yet!

Second part: process & display

If you want output in JSON format, you can just print the “result” object:

mongos> printjson(result)

But if you want CSV format, you can use this script:

mongos> for(var databases in result) {
    var database=result[databases];
	for(var collections in database) {
		var collection=database[collections]
		for(var shards in collection) {
			var shard=collection[shards]
			print(databases+";"+collections+";"+shards+";"+shard.chunks+";"+shard.jumbo+";"+shard.size+";"+shard.numrows);
		}
	}
}

The output looks like this:

arotest;pokemon;rs01;2;0;0.01;30
arotest;pokemon;rs02;2;0;0.01;22
arotest;pokemon;rs03;2;0;0.01;25
arotest;pokemon;rs04;2;0;0.01;19
arotest;pokemon;rs05;2;0;0.01;25
arotest;pokemon;rs06;2;0;0.01;28
arotest;test;rs01;2;0;54.61;177060
arotest;test;rs02;2;0;54.62;177080
arotest;test;rs03;2;0;54.40;176350
arotest;test;rs04;2;0;54.59;176987
arotest;test;rs05;2;0;54.62;177079
arotest;test;rs06;2;0;54.42;176444

You just have to add a heading line if needed:”database;collection;shard;number of chunks;number of jumbo chunks;data size;number of rows”.

Once imported in Excel:
MongoDB chunk distribution with Excel

The “data size” field is expressed in megabytes. It does not reflect the amount of storage used by your data, but the size in memory of all records in the shard.

If you need the storage size, you can modify the first script by replacing:

var size=(db.getSiblingDB(database).getCollection(collection).stats().shards[shard].size/1024/1024).toFixed(2);

with

var size=(db.getSiblingDB(database).getCollection(collection).stats().shards[shard].storageSize/1024/1024).toFixed(2);

Have a nice day, and stay tuned for more DBA stuff!

Leave a Reply

Your email address will not be published. Required fields are marked *