Regular Expression to find src from IMG tag

10,248

Solution 1

You don't want a regular expression, you want a parser. From this question:

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//img[@src]");

        foreach (var node in nodes)
        {
                Console.WriteLine(node.src);
        }
    }
}

Solution 2

As pointed out, regular expression are not the perfect solution, but you can usually build one that is good enough for the job. This is what I would use:

string newHtml = Regex.Replace(html,
      @"(?<=<img\s+[^>]*?src=(?<q>['""]))(?<url>.+?)(?=\k<q>)",
      m => "http://www.stackoverflow.com" + m.Value);

It will match src attributes delimited by single or double quotes.

Of course, you would have to change the lambda/delegate to do your own replacing logic, but you get the idea :)

Share:
10,248
Waheed
Author by

Waheed

Results-driven professional that has demonstrated capabilities to learn new languages and products with over nine year of experience in the computer software consulting company. Exceptional record of increasing operating efficiency and boosting profitability through expertise in database administration, computer software engineering, operations management, and staff supervision. Proven track of designing and implementing flexible solutions which support frequent UI and functionality changes.

Updated on June 27, 2022

Comments

  • Waheed
    Waheed almost 2 years

    I have a web page. From that i want to find all the IMG tags and get the SRC of those IMG tags.

    What will be the regular expression to do this.

    Some explanation:

    I am scraping a web page. All the data is displayed correctly except the images. To solve this, wow i have an idea, to find the SRC and replace it : e.g

    /images/header.jpg
    

    and replace this with

    www.stackoverflow/images/header.jpg