How do you remove invalid characters when creating a friendly url (ie how do you create a slug)?

16,822

Solution 1

I've come up with the 2 following extension methods (asp.net / C#):

public static string RemoveAccent(this string txt)
{
    byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt);
    return System.Text.Encoding.ASCII.GetString(bytes);
}
public static string Slugify(this string phrase)
{
    string str = phrase.RemoveAccent().ToLower();
    str = System.Text.RegularExpressions.Regex.Replace(str, @"[^a-z0-9\s-]", ""); // Remove all non valid chars          
    str = System.Text.RegularExpressions.Regex.Replace(str, @"\s+", " ").Trim(); // convert multiple spaces into one space  
    str = System.Text.RegularExpressions.Regex.Replace(str, @"\s", "-"); // //Replace spaces by dashes
    return str;
}

Solution 2

It depends on the language you are using and the technique you want to use. Take a look at this snippet of JavaScript from the Django source, it does exactly what you need. You can easily port it to the language of your choice I guess.

This is the Python snippet used in the Django slugify function, it's a lot shorter:

def slugify(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    import unicodedata
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
    value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
    return re.sub('[-\s]+', '-', value)

I think every language got a port of this, since it's a common problem. Just Google for slugify + your language.

Solution 3

The best method IMO is to whitelist characters rather than trying to look for invalid characters. However, accented characters like é are fairly common (and your URL will be odd without them) so you could convert these first.

In PHP you can use the strtr function, but you should be able to modify this for your needs on asp.net:

strtr(
  'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûýýþÿŔŕ',
  'aaaaaaaceeeeiiiidnoooooouuuuybsaaaaaaaceeeeiiiidnoooooouuuyybyrr'
);

Now here's your process:

  1. [optional] Convert the string to lowercase (usually recommended for URLs).
  2. [optional] Convert the accented characters using the above mapping.
  3. Run through your input string character-by-character.
  4. It may be faster to do #1 and #2 per-character instead of on the whole string, depending on what built-in functions you have.
  5. If the character is in the range a-z or 0-9, add it to your new string, otherwise:
    a) If you already have a hyphen on the end of your new string, ignore it
    b) If not, add a hyphen to the end of the string.
  6. When you get to the end, remove and leading or trailing hyphens and you're done!

Solution 4

You could add a new field to the Products table that contained an URL safe and unique name for each product. This could probably be automatically generated initially (substituting non-safe characters with closest safe equivalent - gora-aldre?) and then fine tuned as needed.

Since the replacement of non-safe characters is not (always) reversible, it isn't entirely feasible to do this kind of thing on the fly.

Alternatively, you build the URL thusly:

http://example.com/products/1234/safe-string

Where safe-string is created on the fly replacing unsafe characters as needed. The number 1234 is the product key. You use the key to look up the product, the 'safe-string' is there more for the user and search engines.

Solution 5

Two things to keep in mind:

  1. URL rewriting generally does not have a positive effect on search engines (and frequently a negative one) -- so you should only do it if you know of a measurable positive effect on user satisfaction (and accordingly: make your URLs useful for the users).

  2. If you do decide to do URL rewriting, you must have the technical details down perfectly. For instance, you should never have more than one unique URL showing the same content. Make sure you use UTF-8 for the encoding of non-ASCII content, use escaped links within your content, and generally test on various browsers to make sure things work as planed. If any of this is foreign to you, then I would strongly recommend not doing URL rewriting for the moment.

FWIW Some of the search engine side issues are covered at http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html

Share:
16,822

Related videos on Youtube

Anthony
Author by

Anthony

Updated on September 18, 2022

Comments

  • Anthony
    Anthony 4 months

    Say I have this webpage: http://ww.xyz.com/Product.aspx?CategoryId=1

    If the name of CategoryId=1 is "Dogs" I would like to convert the URL into something like this: http://ww.xyz.com/Products/Dogs

    The problem is if the category name contains foreign (or invalid for a url) characters. If the name of CategoryId=2 is "Göra äldre", what should be the new URL?

    Logically it should be: http://ww.xyz.com/Products/Göra äldre but it will not work.

    Firstly because of the space (which I can easily replace by a dash for example) but what about the foreign characters? In Asp.net I could use the URLEncode function which would give something like this: http://ww.xyz.com/Products/G%c3%b6ra+%c3%a4ldre but I can't really say it's better than the original URL (http://ww.xyz.com/Product.aspx?CategoryId=2).

    Ideally I would like to generate this one but how can I can do this automatically (ie converting foreign characters to 'safe' URL characters): http://ww.xyz.com/Products/Gora-aldre.

  • Anthony
    Anthony over 12 years
    The Products table is just an example and the actual string will not necessarily come from our database so I need a way to do this automatically. For example on this site the url is generated on the fly based on the question asked. How do they automatically remove non-safe characters if those non-safe characters are used in the question itself?
  • Kris
    Kris over 12 years
    @Anthony You create and save the url safe version when the 'question' is created or you go with my alternative (see edit)
  • Anthony
    Anthony over 12 years
    Thanks but how do you do this exactly : 'safe-string is created on the fly replacing unsafe characters as needed'?
  • David Z
    David Z over 12 years
    In my experience Python makes this far easier than most other languages, because of the unicodedata module.