Clear cache of virtual machine

5,773

This question seems to be based on the premise that the speed increase the second time around is "due to the OS caching that just keeps all the data in main memory". I wouldn't be so sure that's the only difference between the first and subsequent runs. If the difference in performance was the host caching VM RAM, then the difference from a reboot of the VM should be negligible, and you'd need to reboot the host to see any difference.

For one thing that can impact performance between the first and subsequent runs, query compilation and parsing as well as determining an appropriate execution plan is also fairly hard work for the database engine, so the results of that are usually cached. The impact of this may be negligible to substantial, depending on what else the database engine has to do to satisfy the query.

If you have enough RAM to do so, one way to work around caching would be to simply move the database files to a large RAM disk for the duration of your tests. By monitoring the I/O statistics, you can estimate the amount of I/O incurred by the query and hence the effects on performance of various optimization techniques, without needing to worry about the effects of data caching because all the data already is in RAM.

You don't say which database engine you are running, so it's hard to give specific suggestions. In Microsoft SQL Server, you'd do something like SET STATISTICS IO,TIME ON and/or SET STATISTICS PROFILE before executing your query to get data on how hard the database server has to work to execute the query in question; other database engines almost certainly have similar features (it's a basic prerequisite for query performance tuning). Note that such statistics often include the number of actual I/O requests, and that since those I/O requests can but not necessarily will be satisfied from any OS-level cache, those numbers can be a useful indicator of how much data is involved in query execution. Large differences between the query plan and actual outcome, particularly in amounts of I/O or number of rows in various contexts, will have performance implications because it means the database engine is making poor decisions on which algorithms to use. Large amounts of I/O anywhere may very well mean you are hitting the disk more than necessary which will come at a performance cost.

Share:
5,773

Related videos on Youtube

Antigo
Author by

Antigo

Updated on September 18, 2022

Comments

  • Antigo
    Antigo over 1 year

    I already posted this on StackOverflow, but it was flagged as Off-Topic. Maybe you guys are able to help me.

    I'm currently doing some database benchmarking on a virtual machine running Ubuntu 12.04. I've noticed that the second time I execute a query it runs significantly faster. This ist most likely due to the OS caching that just keeps all the data in main memory. To keep the cache from screwing up my measurements I thus want to clear it between subsequent runs.

    I found the following commands to achieve this on the google:

    sync;echo 3 > /proc/sys/vm/drop_caches
    

    and

    sysctl -w vm.drop_caches=3
    

    which all yield in a permission denied error even when I'm logged in as root. It seems like it is just not possible to clear the system's cache from the guest system. I guess this is because it uses the hosts cache. As I don't have access to the host I have to find a workaround. Currently I have two ideas.

    First idea is to reboot the machine between runs as this clears the cache. As I want to perform a couple dozen runs I really need to automate this. So I could put a program into autostart let it perform a query and reboot and continue with the next query on the next startup. Feels like writing a virus though.

    Second idea is to just flood the memory with other data. As my machine has quite a bit of RAM i would e.g. generate some large file of random data an just read it into /dev/null.

    So finally my question is, has anyone a better idea to clear the cache, or maybe avoid the usage of the cache all together? Or has anyone some suggestions on how to implement one of my two ideas easily?

    Thank you very much in advance, Antigo

    • c4f4t0r
      c4f4t0r over 10 years
      what kind of database are you using? if you don't want your query are cached by OS, use IODIRECT
  • Antigo
    Antigo over 10 years
    Moving the database to RAM is no solution. I want to measure the performance of the databases on disk. If everything is on RAM disk I can't do that. And I'm not only using relational, but also NoSQL databases, that mostly have no query planner.
  • user
    user over 10 years
    @Antigo So ignore the words "execution plan", then. Since you didn't specify what you are working with, any answer will be very generic; if you want specific answers, you need to provide specific details in the question. And since your goal is to remove the effects of OS-level caching on query performance, putting the data in RAM should accomplish that. Look at how much data is accessed while the query is being executed and optimize the code to minimize that, and you will automatically reduce disk I/O as well, resulting in improved performance.