C# Convert Relative to Absolute Links in HTML String

15,320

Solution 1

The most robust solution would be to use the HTMLAgilityPack as others have suggested. However a reasonable solution using regular expressions is possible using the Replace overload that takes a MatchEvaluator delegate, as follows:

var baseUri = new Uri("http://test.com");
var pattern = @"(?<name>src|href)=""(?<value>/[^""]*)""";
var matchEvaluator = new MatchEvaluator(
    match =>
    {
        var value = match.Groups["value"].Value;
        Uri uri;

        if (Uri.TryCreate(baseUri, value, out uri))
        {
            var name = match.Groups["name"].Value;
            return string.Format("{0}=\"{1}\"", name, uri.AbsoluteUri);
        }

        return null;
    });
var adjustedHtml = Regex.Replace(originalHtml, pattern, matchEvaluator);

The above sample searches for attributes named src and href that contain double quoted values starting with a forward slash. For each match, the static Uri.TryCreate method is used to determine if the value is a valid relative uri.

Note that this solution doesn't handle single quoted attribute values and certainly doesn't work on poorly formed HTML with unquoted values.

Solution 2

You should use HtmlAgility pack to load the HTML, access all the hrefs using it, and then use the Uri class to convert from relative to absolute as necessary.

See for example http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/

Solution 3

Uri WebsiteImAt = new Uri(
       "http://www.w3schools.com/media/media_mimeref.asp?q=1&s=2,2#a");
string href = new Uri(WebsiteImAt, "/something/somethingelse/filename.asp")
       .AbsoluteUri;
string href2 = new Uri(WebsiteImAt, "something.asp").AbsoluteUri;
string href3 = new Uri(WebsiteImAt, "something").AbsoluteUri;

which with your Regex-based approach is probably (untested) mappable to:

        String value = Regex.Replace(text, "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", match => 
            "<" + match.Groups[1].Value + match.Groups[2].Value + "=\""
                + new Uri(WebsiteImAt, match.Groups[3].Value).AbsoluteUri + "\""
                + match.Groups[4].Value + ">",RegexOptions.IgnoreCase | RegexOptions.Multiline);

I should also advise not to use Regex here, but to apply the Uri trick to some code using a DOM, perhaps XmlDocument (if xhtml) or the HTML Agility Pack (otherwise), looking at all //@src or //@href attributes.

Solution 4

While this may not be the most robust of solutions it should get the job done.

var host = "http://domain.is";
var someHtml = @"
<a href=""/some/relative"">Relative</a>
<img src=""/some/relative"" />
<a href=""http://domain.is/some/absolute"">Absolute</a>
<img src=""http://domain.is/some/absolute"" />
";


someHtml = someHtml.Replace("src=\"" + host,"src=\"");
someHtml = someHtml.Replace("href=\"" + host,"src=\"");
someHtml = someHtml.Replace("src=\"","src=\"" + host);
someHtml = someHtml.Replace("href=\"","src=\"" + host);

Solution 5

You could use the HTMLAgilityPack accomplish this. You would do something along these (not tested) lines:

  • Load the url
  • Select all links
  • Load the link into a Uri and test whether it is relative If it relative convert it to absolute
  • Update the links value with the new uri
  • save the file

Here are a few examples:

Relative to absolute paths in HTML (asp.net)

http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home

http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/

Share:
15,320
Gary
Author by

Gary

Updated on June 13, 2022

Comments

  • Gary
    Gary almost 2 years

    I'm mirroring some internal websites for backup purposes. As of right now I basically use this c# code:

    System.Net.WebClient client = new System.Net.WebClient();
    byte[] dl = client.DownloadData(url);
    

    This just basically downloads the html and into a byte array. This is what I want. The problem however is that the links within the html are most of the time relative, not absolute.

    I basically want to append whatever the full http://domain.is before the relative link as to convert it to an absolute link that will redirect to the original content. I'm basically just concerned with href= and src=. Is there a regex expression that will cover some of the basic cases?

    Edit [My Attempt]:

    public static string RelativeToAbsoluteURLS(string text, string absoluteUrl)
    {
        if (String.IsNullOrEmpty(text))
        {
            return text;
        }
    
        String value = Regex.Replace(
            text, 
            "<(.*?)(src|href)=\"(?!http)(.*?)\"(.*?)>", 
            "<$1$2=\"" + absoluteUrl + "$3\"$4>", 
            RegexOptions.IgnoreCase | RegexOptions.Multiline);
    
        return value.Replace(absoluteUrl + "/", absoluteUrl);
    }
    
  • Gary
    Gary over 13 years
    Does this change the links within the html at myUri from relative to absolute, or is this just better practice for using the WebClient?
  • Gary
    Gary over 13 years
    I added an edit that works at least in my few test cases. Looking at the regex stuff, it looks fairly similar, but your code looks much more complicated. I honestly have never used the MatchEvaluator and the delegate stuff; is your code better?
  • Nathan Baulch
    Nathan Baulch over 13 years
    Using a MatchEvaluator allows you to vastly simplify the regex pattern and use the much more robust Uri.TryCreate method instead. A regex that matches all possible URIs would be extremely complex.
  • Smith
    Smith over 11 years
    i did try your example, but there seems to be a bug. if i have a baseUrl as http://ww.baseurl.com/somedir and i try to create an absolut path adding /login.php using your method, i get http://ww.baseurl.com/login.php instead of http://ww.baseurl.com/somedir/login.php