How to get the "Text" of a html page ? (Webbrowser - Delphi)

17,022

Solution 1

you can use a TWebBrowser instance to parse and select the plaint text from html code.

see this sample

uses
MSHTML,
SHDocVw,
ActiveX;

function GetPlainText(Const Html: string): string;
var
DummyWebBrowser: TWebBrowser;
Document       : IHtmlDocument2;
DummyVar       : Variant;
begin
   Result := '';
   DummyWebBrowser := TWebBrowser.Create(nil);
   try
     //open an blank page to create a IHtmlDocument2 instance
     DummyWebBrowser.Navigate('about:blank');
     Document := DummyWebBrowser.Document as IHtmlDocument2; 
     if (Assigned(Document)) then //Check the Document
     begin
       DummyVar      := VarArrayCreate([0, 0], varVariant); //Create a variant array to write the html code to the  IHtmlDocument2
       DummyVar[0]   := Html; //assign the html code to the variant array
       Document.Write(PSafeArray(TVarData(DummyVar).VArray)); //set the html in the document
       Document.Close;
       Result :=(Document.body as IHTMLBodyElement).createTextRange.text;//get the plain text
     end;
   finally
     DummyWebBrowser.Free;
   end;
end;

Solution 2

You should look at using the Delphi DOM HTML parser

Solution 3

If your asterisk is constant, you can simply get everychar between **. If your asterisk is not constant you can rewrite this string and erase all tags (things who starting from < and ending with >. Or you can use some DOM parser library for it.

Solution 4

In essence: in general you can't.

HTML is a markup language with such a wide use and mind boggling possibilities to change the content dynamically that it is virtually impossible to do this (just look at how hard the web browser suppliers need to work to pass for instance the acid tests). So you can only do a subset.

For specific and well defined subsets of HTML, then you have a better chance:

First you need to get the HTML in a string, then parse that HTML.

Getting the HTML can be done for instance using Indy (see answers to this question).

Parsing highly depends on your HTML and can be quite complex, you can try this question or this search.

You could use TWebBrowser as RRuz suggests, but it depends on Internet Explorer.
Modern Windows systems do not guarantee that Internet Explorer is installed any more...

--jeroen

Share:
17,022
Kermia
Author by

Kermia

Near ... Far .. Wherever you are !

Updated on June 20, 2022

Comments

  • Kermia
    Kermia almost 2 years

    I'm using WebBrowser to get source of html pages . Our page source have some text and some html tags . like this :

    FONT></P><P align=center><FONT color=#ccffcc size=3>**Hello There , This is a text in our html page** </FONT></P><P align=center> </P>
    

    Html tags are random and we can not guess them . So is there any way to get texts only and separating them from html tags ?