![]() With that, I can now start applying more techniques to narrow down what is going on. By being able to capture the input to the process and play with the batch size, we were able to reduce the amount of work required to generate a reproduction from 50M records to 30,000 and have a lead into what is going on. There are a bunch of work that we do around batch optimizations that likely cause this sort of behavior. The key challenge in this bug is that it is probably triggered as a result of a commit and an index of the next batch. So now I have a repro in 30,000 items, what is even better, a debug assertion was fired, so I have a really good lead into what is going on. Even more important, when I run this in debug mode, I get this: 1 With 0Īt .Get(Low. So now I dropped the search area to 120,000 items, that is pretty awesome. Here is the output when I run this in release mode: 1 With 0įatal error. ![]() When using that, I can reproduce this far faster. In particular, instead of using the commit points in the trace, I can force a commit (and start / stop of the database) every 10,000 items (by calling FlushIndexAndRenewWriteTransaction). The key aspect with this is that I now have a way to play around with things. That was a massive win, since it dropped my search area from 50M documents to merely 1.2 millions. What is more important, note that we have calls to StopDatabase() and StartDatabase(), I was able to reproduce the crash using this code. The code itself isn’t much, but it does the job. What is important is that I have captured the behavior of the system and can now replay it at will. Don’t bother to dig into that, the code itself isn’t really that interesting. I have another piece of similarly trivial code that read and apply it, as shown below. What it does is capture the indexing and commit operations on the system and write it to a file. In this case, you can see that this is a development only feature, so it is really bared-bones one. Here is what this looks like in terms of code: Allow me to present you with the nicest tool for debugging that you can imagine: repeatable traces. It also means that we can play certain tricks. That means that we can rule out a huge chunk of issues around race conditions. ![]() The indexing process inside of RavenDB is single threaded per index. Part of good system design is knowing how to address just these sorts of issues. Trying to find an error there, especially one that can only happen after you restart the process is going to be a challenging task. Indexing this in release mode takes about just under 20 minutes. To give some context, this means 51GB of documents to be indexed and 18 GB of indexing. But my only reproduction is a 50M records dataset. In this case, I’m 100% at fault, we are doing a lot of unsafe things to get better performance, and it appears that I messed up something along the way. If I restart the process and run the same query, I get an ExecutionEngineException. If I index the data and query on it, I get the results I expect. In fact, I discovered a really startling issue. That is… quite annoying, as you can imagine. When running the same scenario under the profiler, the process crashed. It makes sense, since I was heavily refactoring to get a particular structure, I could think of a few ways to improve performance, but I like doing this based on profiler output. I threw 50M records into RavenDB and indexed them. Now was the time to test how this change performed. The first two tiers of tests all pass, which is great. I just completed a major refactoring of a piece of code inside RavenDB that is responsible for how we manage sorted queries.
0 Comments
Leave a Reply. |