Selenium - driver.getPageSource() differs than the source viewed from browser

33,370

Solution 1

The "source" code you get from Selenium seems to not be the source at all. It seems to be the HTML for the current DOM. The source code you see in the browser is the HTML as given by the server, before any dynamic changes made to it by JavaScript. If the DOM changes at all, the browser source code doesn't reflect those changes, but Selenium will. If you want to see the current DOM in a browser, you'd use the developer tools, not the source code.

Solution 2

I encountered the same problem. I use these code to solve it:

......
String javascript = "return arguments[0].innerHTML";
String pageSource=(String)(JavascriptExecutor)driver)
    .executeScript(javascript, driver.findElement(By.tagName("html")));
pageSource = "<html>"+pageSource +"</html>";
System.out.println(pageSource);
//FileUtils.write(new File("e:\\test.html"), pageSource,);
......

By using JavaScript code to get the innerHTML property, it finally works, and the question marks disappeared.

Solution 3

There are several places where you can get the source from.You can try

String pageSource=driver.findElement(By.tagName("body")).getText();

and see what comes up.

Generally you do not need to wait for the page to load.Selenium does that automatically,unless you have separate sections of Javascript/Ajax.

You might want to add what are the differences that you are seeing, so that we can understand what you really mean.

Webdriver does not render the page on its own,it just renders it as the browser sees it.

Share:
33,370
roger_that
Author by

roger_that

Java Software Developer

Updated on July 30, 2022

Comments

  • roger_that
    roger_that almost 2 years

    I am trying to capture the source code from the URL specified into an HTML file using selenium, but I don't know why, I am not getting the exact source code which we see from the browser.

    Below is my java code to capture the source in an HTML file

    private static void getHTMLSourceFromURL(String url, String fileName) {
    
        WebDriver driver = new FirefoxDriver();
        driver.get(url);
    
        try {
            Thread.sleep(5000);   //the page gets loaded completely
    
            List<String> pageSource = new ArrayList<String>(Arrays.asList(driver.getPageSource().split("\n")));
    
            writeTextToFile(pageSource, originalFile);
    
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    
        System.out.println("quitting webdriver");
        driver.quit();
    }
    
    /**
     * creates file with fileName and writes the content
     * 
     * @param content
     * @param fileName
     */
    private static void writeTextToFile(List<String> content, String fileName) {
        PrintWriter pw = null;
        String outputFolder = ".";
        File output = null;
        try {
            File dir = new File(outputFolder + '/' + "HTML Sources");
            if (!dir.exists()) {
                boolean success = dir.mkdirs();
                if (success == false) {
                    try {
                        throw new Exception(dir + " could not be created");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                }
            }
    
            output = new File(dir + "/" + fileName);
            if (!output.exists()) {
                try {
                    output.createNewFile();
                } catch (IOException ioe) {
                    ioe.printStackTrace();
                }
            }
            pw = new PrintWriter(new FileWriter(output, true));
            for (String line : content) {
                pw.print(line);
                pw.print("\n");
            }
        } catch (IOException ioe) {
            ioe.printStackTrace();
        } finally {
            pw.close();
        }
    
    }
    

    Can someone throw some light into this as to why this happens? How WebDriver renders the page? And how browser shows the source?

  • roger_that
    roger_that over 10 years
    String pageSource=driver.findElement(By.tagName("body")).getText(); gives just the body text but I need the complete HTML code with tags and all, so i guess this is not the requirement.
  • Madusudanan
    Madusudanan over 10 years
    Then you must add what are the differences that you see,we will not be able to give you a solution without that.
  • roger_that
    roger_that over 10 years
    I am unable to figure out the way to show you the difference. I am using Java-Diff-Util to compare the two HTML files, one created using above code and another manually by saving source from browser and copying the Deltas into a difference.txt file. The results are pretty weird and clumsy to show here. What to do ?
  • Madusudanan
    Madusudanan over 10 years
    There might be several places where the changes might be.I have had situations where right click view source was fine and getting source using driver.getPageSource() added some extra '-' in the page and I had to replace these in order to get the proper source.Your use case might be different.You have to manually compare these in order to know what are the difference and then work on it.But, what is the functional testing you are achieving by doing this??
  • roger_that
    roger_that over 10 years
    My idea or the requirement was to compare the webpage with an older one in order to identify any changes in the page. No problem, there is a library to do so DaisyDiff. I was getting crystal clear diff results in HTML form when i tried to manually grab the code and fed it to the comparator code. But, when I tried to create HTML source file automatically, apart from UI, lot of html text was also coming in the result, so thought that WebDriver might be copying extra code. However, I feel < were coded as &lt; due to which it was not rendered as HTML tag