How to improve MongoDB find query performance?-CodePudding

I have a collection collection1 with documents like such:

{
  _id: 123,
  field1: "test",
  array1: [
    {
      array2: [
        {
          field2: 1,
          object1: {
            field3: "test"
          }
        }
      ]
    }
  ]
}

I am trying to get all the documents from the collection filtered by fields field1, field2 and field3. The query I have looks like:

db.collection1.find(
{
  field1: "test",
  array1: {
    $elemMatch: {
      array2: {
        $elemMatch: {
          field2: {
            $gte: 1
          }, 
          "object1.field3": "test"
        }
      }
    }
  }
})

This collection has ~125,000 documents. Given how the query has to skim through two nested arrays for filtering, one would expect this query to be slow. And it is, taking around 30-40 seconds. So, to improve its performance, I created an index for all 3 fields, which looks like db.collection1.createIndex({"array1.array2.object1.field3": 1, "array1.array2.field2": 1, "field1": 1});

Using the index, the query is twice as fast, taking ~15 seconds. However, that is still too slow. I want to get the query <5 seconds. Any ideas on how I can improve the speed? In case it helps, I can add the query planner for both queries (using and not using index).

EDIT: I tried using all 6 possible combinations of different ordering of the fields in the index, and they all had the same results. So then I paid more attention to the query planner and execution stats of the query and I noticed something:

"queryPlanner" : {
        "plannerVersion" : 1,
        "namespace" : "db.collection1",
        "winningPlan" : {
            "stage" : "FETCH",
            "inputStage" : {
                "stage" : "IXSCAN",
                "indexName" : "fields_index"
            }
        }
    },
    "executionStats" : {
        "executionSuccess" : true,
        "executionTimeMillis" : "15602.784",
        "planningTimeMillis" : "0.248",
        "executionStages" : {
            "stage" : "FETCH",
            "nReturned" : "0",
            "executionTimeMillisEstimate" : "15602.130",
            "inputStages" : [
                {
                    "stage" : "IXSCAN",
                    "nReturned" : "300220",
                    "executionTimeMillisEstimate" : "87.616",
                    "indexName" : "fields_index"
                },
                {
                    "nReturned" : "0",
                    "executionTimeMillisEstimate" : "0.018"
                }
            ]
        }
    },
    "serverInfo" : {
        "host" : "mongo-instance",
        "port" : 27017,
        "version" : "3.6.0"
    },
    "ok" : 1

It seems that the FETCH stage is the one taking extremely long, not the index scan. Why is that? Also, with the parameters I am using, the query is meant to return no results. The FETCH stage does return 0 results, but the index scan returns 300220 documents. Why?

CodePudding user response：

The order is not the issue here, the problem is you're not fully understanding how Mongo indexes array's.

The way Mongo does it is by flattening the arrays and indexing each element individually, This means that elements that look like so (below) will still match the index hence making the FETCH stage much bigger than it needs to be.

{
  _id: 123,
  field1: "test",
  array1: [
    {
      array2: [
        {
          field2: 1,
          object1: {
            field3: "no-test"
          }
        },
        {
          field2: 2,
          object1: {
            field3: "test"
          }
        }
      ]
    }
  ]
}

So what can we do?

First let's order the index in a more natural way, by having test as the field field in the compound index.
Index the full elements in array2, right now as I mentioned each key is flattened which makes the index have redundancies as you're querying a whole element. So instead of this:

"array1.array2.object1.field3": 1, "array1.array2.field2": 1

You should do:

"array1.array2": 1

This will obviously create a much larger index tree, which potentially could affect update's performance. If that nested object is too big step 2 might not be for you, but it is something that will boost your query speed.

CodePudding user response：

Here is a variation on the theme. I created a collection with 200,000 docs. 100,000 have field1 set to NOTtest so they don't even make the first cut. Of the other 100,000, each has an array1 of length 2 and within each, an array2 of length 3. One in 20,000 of those leaf elements is set to object1.field3:"test" and field2:4 to make it >1 and satisfy a two condition query (OP had gte 1, I made it gt 1 to make it clearer). So only 5 docs of the 200,000 will satisfy the required query. On a MacBookPro, the following query yields the 5 docs in 2.4 seconds with no indexes. The trick is to use $map to "dive" into the array to get to the desired target array and then use $filter to yield a populated array or an empty one. Empty arrays mean no match and are filtered out in the next stage.

This approach has the added advantage of returning only the subdocs with the matching fields. The challenge with $elemMatch is that a match on a subdoc in the array returns the whole array which might include subdocs that do not match. These must be further filtered in the pipeline or post-processed in the client-side code.

db.foo.aggregate([
    {$match: {field1: "test"}},

    {$project: {
        XX:{$map: {input: "$array1", as:"z1", in:
                {QQ: {$filter: {input: "$$z1.array2",
                                as: "z2",
                                cond: {$and:[
                                    {$eq:["$$z2.object1.field3", "test"]},
                                    {$gt:["$$z2.field2",1]}
                                ]}
                     }}
        }
        }}
    }}

    ,{$match: {$expr: {
        // total of length of QQ array(s) must be > 0                                                                      
        $gt:[ {$reduce: {input: "$XX",
                         initialValue: 0,
                         in: {$add:["$$value",{$size: "$$this.QQ"}]}
               }}, 0]
        }
    }}
]);

With the material substantially cut down, you can now $unwind and $project and otherwise tailor the output to your needs.

$map can be "chained" to dive to arbitrary depths:

var r = [
    {array1: [
        {array2: [
            {array3: [
                {array4: [
                    {f: "X"},
                    {f: "A"},
                    {f: "A"}
                ]}
            ]
            }
        ]}
    ]}
    ,
    {array1: [
        {array2: [
            {array3: [
                {array4: [
                    {f: "X"},
                    {f: "X"}
                ]}
            ]
            }
        ]}
    ]}
]

db.foo2.drop();
db.foo2.insert(r);

c = db.foo2.aggregate([
    {$project: {XX:
      {$map: {input: "$array1", as:"z1", in:
              {$map: {input: "$$z1.array2", as: "z2", in:
                      {$map: {input: "$$z2.array3", as: "z3", in:
                              {QQ: {$filter: {input: "$$z3.array4",
                                              as: "z4",
                                              cond: {$eq:["$$z4.f","A"]}
                                             }}
                              }
                       }}
             }}
         }}
   }}
]);

The output is admittedly a little array heavy but this approach avoids deep multiple $unwind that could explode a dataset by orders of magnitude.

CodePudding user response：

I found out the issue. Something I didn't mention in my original question is that I am using AWS' DocumentDB service, which has MongoDB compatibility. According to this, under the section "$ne, $nin, $nor, $not, $exists, and $elemMatch Indexing", it says that DocumentDB does not support using indexes with $elemMatch. The reason my query was using the index is because it used it for field1, which is not under $elemMatch. However, it did not work for the other two, so it still had to do a scan over thousands of results and filter by field2 and field3.

The way I fixed it was by rewriting my query. According to MongoDB documentation, I don't need to use $elemMatch for my query. So my query now looks like:

db.collection1.find(
{
  field1: "test",
  "array1.array2.field2": {
    $gte: 1
  }, 
  "array1.array2.object1.field3": "test"
})

The query functionally does the exact same, but this way it actually uses the index. It now takes <1 second to run the query. Thanks for all the help and great suggestions from everyone!