Issues with Server 2012 using DFSR running on Hyper-V 2012

10,732

Ok I am not sure if this will be of any help but the factor I have in common with you is that i had my drives connected to a PERC H310 controller and I was running a file server in a Virtual environment mapping its data drive to a Raw disk connected to the same H310. At random times usually during periods of High I/O The virtual machine would complain that it could not access the drive and would crash. I ended up connecting the drives to the onboard Intel controller and had no problems since. I personally think the low end Perc cards have quirks that can cause issues with I/O sensitive operations.

Share:
10,732
Kcmamu
Author by

Kcmamu

Developer, Systems Admin, Geek, Gadget lover, etc. etc. I started programming in BASIC at the age of 11 on a Sinclair ZX81, advanced to a BBC Model B, where I learned 6502 assembly language programming. I never really worked with PCs until the early 90s. In the late 90s, I joined a higher educational institution as a desktop technician, a quickly got promoted to be a systems admin, working predominantly on Windows systems, but also had a keen interest in Linux systems. I later got involved in software development, working in C#, PHP, C. In my current employment, I'm the manager of the company's Information Systems department. The primary focus of our business is industrial control systems (mostly legacy systems). The work isn't exclusively legacy/control systems though, as we also support modern systems for a number of business customers.

Updated on September 18, 2022

Comments

  • Kcmamu
    Kcmamu almost 2 years

    We have a number of Server 2012 systems, all of which run virtualised on Hyper-V 2012 server. We are having problems with two such virtual instances, both of which are used as file servers, whereby they occasionally stop responding to requests to serve files to clients. After logging on to the server, attempts to shut it down gracefully fail (no error, it just fails to acknowledge a shutdown request).

    Recovery is a case of power cycling the server(s) from the Hyper-V console.

    These two servers don't serve a large number of users (one serves no more than 6 users, and the other serves around 20 users), they are in the same domain, but on different physical hardware (and at different sites). They don't lock up at the same time. They both use DFSR to replicate a fairly large amount of data between themselves (200GB) over ADSL connections, this is working fine, and we have been using DFSR to do this on the previous two generations of server OS we have used (Server 2008 R2 and Server 2003 - both of which were physical installs however).

    Today, when one of the servers crashed, I noticed an entry in the event log, which looked similar to the following:

    Log Name:      Application
    Source:        ESENT
    Date:          27/11/2012 10:25:55
    Event ID:      533
    Task Category: General
    Level:         Warning
    Keywords:      Classic
    User:          N/A
    Computer:      HAL-FS-01.example.com
    Description:
    DFSRs (1500) \\.\E:\System Volume Information\DFSR\database_C8CC_101_CC00_EC0E\
    dfsr.db: A request to write to the file "\\.\E:\System Volume Information\
    DFSR\database_C8CC_101_CC00_EC0E\fsr.log" at offset 4423680 (0x0000000000438000)
    for 4096 (0x00001000) bytes has not completed for 36 second(s). This problem is
    likely due to faulty hardware. Please contact your hardware vendor for further
    assistance diagnosing the problem.
    

    When the server started up again, I went to find the event log entry to investigate further and found that the event log entry was no longer there (I assume it was in memory but failed to write to disk before the server was powered off, for the reason mentioned in the message). I found the above message by searching back further in the event log.

    Both of these virtual servers have their E: volumes fully allocated as opposed to dynamically expanding, and there are no other issues on any of the other virtual servers (which include server 2012, server 2008 R2 and Ubuntu 12.04 x64). There are no signs of IO, memory or CPU starvation on the host systems.

    I've used performance counters on the affected virtual servers to monitor memory usage (including non paged pool usage), as well as CPU and network utilisation, and none of these show any signs of trouble when the issue arises.

    I would have thought our configuration isn't that uncommon, so I'm wondering if anyone else has seen this, and managed to resolve the problem?

    The host specifications are as follows:

    hal-vm-01 running a total of 5 virtual servers (affected file server, DC + other guests) is a Dell Poweredge R710, 16GB, 6 x 300GB SAS 15K RAID 10, Perc H700

    hey-vm-01 System running 2 virtual servers (affected File server and DC) Dell Poweredge T620, 16GB, 2 x 3TB SATA RAID 1, Perc H310

    We have a further virtual server hal-vm-02 running 5 guests, which is unaffected by this problem and is a lower spec than hal-vm-01, but loaded about the same (exchange, DC, SQL + other guests). More memory is on the way so that we can configure shared nothing failover between this host and 'hal-vm-01'.

    There is AV software (MS SCEP) running on the two virtual servers that are affected, they are configured to scan on create only, and to not scan files created by the dfsrs.exe process. There is no AV software running on the VM hosts themselves.

    We are using Windows Server 2012 backup on the host hal-vm-01 to backup all the VMs, this runs out of hours. The other affected server hey-vm-01 isn't backed up, as it's just an off site DFSR copy of the data at our main office. Another backup job runs on the affected virtual guest hal-fs-01, this also uses Windows Server backup, to take snapshots of the data stored in the DFS replicated shares. Both backup jobs run out of office hours.


    Three months later...

    We've had a support ticket open with Microsoft for over three months now, there have been lots of performance counter logs, memory dumps, event logs sent to Microsoft. The analysis they've performed indicated a problem with one of the virtual drives of the hal-fs-01 (the virtual server with the problem). The virtual drive in question was the server's E:\ drive, which just happened to have all our DFSR groups and shares. Recently, I moved all data off the E:\ drive to many smaller virtual disks that I added to the server, and of course moved all the shares and DFSR groups, leaving just Windows Deployment Services files on the E:\ drive. Despite this, we still saw the problem with writes to the E:\ drive failing.

    Last week I've moved the WDS files to a new virtual disk and also disabled the WDS service. I've also deleted the E:\ virtual disk just in case there was some anomaly with the disk. Since then, we've not yet had another failure, however it's too early to know if this has fixed the problem, as our longest up time was previously around 2 weeks, as of the time of this edit (20/03/2013), we are only one week into the current config, if the problem hasn't surfaced again by next week, I'll be re-enabling WDS, as I have a suspicion that WDS could be the culprit.

    I'll keep this question updated (or provide an answer if I manage to resolve the problem).


    Moved back to Server 2008 R2...

    Not updated the question with progress, but we ended up rolling back to Server 2008 R2, everything works fine. I'd still be interested in hearing about anyone having this issue and managing to find a fix.

    • Kcmamu
      Kcmamu over 11 years
      @pauska I'm inclined to think it's a problem somewhere possibly server 2012's DFSR implementation, Hyper-V 2012 or a weird combination of the two. We have SA, so I'll investigate that, thanks. I can't believe I'm the only one with a configuration like this though, hence why I thought I'd ask here. - Of course copious amounts of Google searches have returned nothing of any interest.
    • pauska
      pauska over 11 years
      Are you by any chance using replicas on these VM's?
    • Kcmamu
      Kcmamu over 11 years
      @pauska, Yes, but only on one of the two affected VMs. The hosts hal-vm-01 and hal-vm-02 replicate all VMs. The affected server on hey-vm-01 doesn't have a replica.
    • pauska
      pauska over 11 years
      Can you disable HV replica on them and see if it solves it? I'm kind of thinking that DFS-R and VSS does not play nice together in WS2012..
    • Kcmamu
      Kcmamu over 11 years
      I'll certainly give that a go @pauska on the one host that has a replica, but the second VM that fails in the same way that doesn't have a replica copy (or doesn't use VSS), which kind of suggests this isn't the problem. Out of interest have you had any experience yourself that makes you think this, or are you basing this on the info in my question?
    • Kcmamu
      Kcmamu over 11 years
      Thanks @pauska for your help. I'll give that a try, it can't hurt.
    • tony roth
      tony roth over 11 years
      do you have a host based antivirus solution running on the parent partition?
    • Kcmamu
      Kcmamu over 11 years
      @tonyroth I've updated the question with details of the AV in use.
    • tony roth
      tony roth over 11 years
      @Bryan "Please contact your hardware vendor for further assistance diagnosing the problem" means the host is having problems, in this case the drive thats hosting the vhd(x) maybe experiencing a problem.
    • Kcmamu
      Kcmamu over 11 years
      @tonyroth understood, but if the host were having problems why are no other VMs on that server being affected? The server is easily capable of handling the load generated by our low user base. Remember we are seeing this on two virtual servers, one of the virtual servers has two guest VMs, one is a file server for a maximum of 6 users and the other is a domain controller. We are talking very low usage here, and more than capable server hardware.
    • longneck
      longneck over 11 years
      What type of storage is backing the VHD's and the DFS data? iSCSI, or local storage?
    • Admin
      Admin over 11 years
      Bizarrely, we have EXACTLY the same problem but a slightly more simple setup than you. Will try and find that log entry too....
    • Kcmamu
      Kcmamu over 11 years
      @longneck Storage is local - SAS 15K on one server, SATA on the less used server.
    • Kcmamu
      Kcmamu over 11 years
      @Julian Interesting, be sure to check your application log before rebooting, as the entry doesn't always get written to disk, and hence no longer exists when you reboot the server.
    • Kcmamu
      Kcmamu over 11 years
      I've now contacted Microsoft PSS regarding the issue.
    • longneck
      longneck over 11 years
      Did you get a resolution to your issue?
    • Huron
      Huron over 10 years
      We've started to encounter a similar issue on a Windows 2012 Hyper-V based system connected to a StarWind SAN with primary (SAS) and secondary (SATA-3) storage. The file server at site A last week become totally unresponsive with log full of ESENT/DFS-R errors. Then the same thing happened yesterday at site B - site A & B replicate between each other. In both cases, we were able to shutdown the virtual file server. The core problem though was that our DFS structure was impacted across all sites until this was resolved
    • Huron
      Huron over 10 years
      It is rather ironic that a technology designed to help business continuity (failover to other site) caused us so problems ;-) The event log is full of ESENT warnings and these usually coincide with when DPM 2012 carries out synchronisation. This uses VSS and as this is (I understand) an atomic operation, I'm not totally surprised that there is a 20 second delay in normal operation whilst the snapshot is created. This would be fine if the warnings were just that "Took a while to respond - ohh yes, it's VSS" but what appears to happen is that DFS-R/ESENT sometimes gets into a state...
    • Huron
      Huron over 10 years
      ...whereby it's continually reporting these errors. If delays in disk writes are an expected side effect of VSS, then I'm in the camp that there is a flaw in the error handling/retry code in ESENT/DFS-R. BTW - like the originator, we used the same system previously on Window 2008 servers running on XenServer and a less-powerful Starwind SAN. That worked fine...
    • Huron
      Huron over 10 years
      BTW - I've watched the Starwind SAN during these DPM sync/VSS windows and it hardly breaks into a sweat disk wise - queue length of around 2 with occasional 5 peaks. The 4 x 1Gbit SAN backbone is busy (60%) which indicates we're network limited and not disk. During sync, it's mainly reads from the SAN to the DPM dedicated RAID-5 array - and we know how fast RAID-5 is at writing... so whilst it might be easy to point finger at SAN/network/disks etc. I feel that ESENT/DFS-R is the cause - it's not resilient enough
  • Kcmamu
    Kcmamu over 11 years
    I'd agree but for the fact that none of the other systems on the same host are affected, and the performance counters don't suggest this is the case.
  • tony roth
    tony roth over 11 years
    @Bryan I tend to agree with tomtom, the perf counters won't show you the issue in this case. How frequently does the problem occur or can you get it to repeat on demand? BTW the error message you posted is probably talking about an issue with the host not the guest.
  • Kcmamu
    Kcmamu over 11 years
    @tonyroth Why wouldn't they? I've got performance counters running on both the host and the guest. I've not found a way of creating the problem on demand, and it happens once every 5 - 10 working days. It always happens when staff are in the office, never out of hours.
  • Kcmamu
    Kcmamu over 11 years
    The usual suspects I guess, Logical & Physical disk, Avg. Queue Lengths (read, write & total), Memory, CPU, Non Paged pool, on both virtual and physical host.
  • tony roth
    tony roth over 11 years
    yes exactly these won't show the issue at all, this article is about w2k8r2 support.microsoft.com/kb/978000 but you can ignore the downloading of the hotfix w2k8r2sp2+ has it by default. Just read the section on setting the values for thresholds. My guess is that the storport etw results will show quite a dropout at the physical disk level, now whats causing this will be the interesting part. btw do this at the host level.
  • Kcmamu
    Kcmamu over 11 years
    We have two physical servers with this problem, one with a PERC H310, one with a PERC H700, I'm pretty sure it isn't this, as it only affects one virtual drive on each server. The common factors for me are Server 2012, DFSR, Hyper-V, as recently noted WDS on both servers.
  • Admin
    Admin over 11 years
    Sorry I could not help. The only other thing I could add is to make sure write caching is turned on in regards to the PERC controllers if you want it enabled that is. I found with some of them when you add disks they default to no write cache which can hinder write speeds. Good luck with the problem I hope you get it solved