How do you remove invalid characters when creating a friendly url (ie how do you create a slug)?
Solution 1
I've come up with the 2 following extension methods (asp.net / C#):
public static string RemoveAccent(this string txt)
{
byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt);
return System.Text.Encoding.ASCII.GetString(bytes);
}
public static string Slugify(this string phrase)
{
string str = phrase.RemoveAccent().ToLower();
str = System.Text.RegularExpressions.Regex.Replace(str, @"[^a-z0-9\s-]", ""); // Remove all non valid chars
str = System.Text.RegularExpressions.Regex.Replace(str, @"\s+", " ").Trim(); // convert multiple spaces into one space
str = System.Text.RegularExpressions.Regex.Replace(str, @"\s", "-"); // //Replace spaces by dashes
return str;
}
Solution 2
It depends on the language you are using and the technique you want to use. Take a look at this snippet of JavaScript from the Django source, it does exactly what you need. You can easily port it to the language of your choice I guess.
This is the Python snippet used in the Django slugify function, it's a lot shorter:
def slugify(value):
"""
Normalizes string, converts to lowercase, removes non-alpha characters,
and converts spaces to hyphens.
"""
import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return re.sub('[-\s]+', '-', value)
I think every language got a port of this, since it's a common problem. Just Google for slugify + your language.
Solution 3
The best method IMO is to whitelist characters rather than trying to look for invalid characters. However, accented characters like é are fairly common (and your URL will be odd without them) so you could convert these first.
In PHP you can use the strtr
function, but you should be able to modify this for your needs on asp.net:
strtr(
'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûýýþÿŔŕ',
'aaaaaaaceeeeiiiidnoooooouuuuybsaaaaaaaceeeeiiiidnoooooouuuyybyrr'
);
Now here's your process:
- [optional] Convert the string to lowercase (usually recommended for URLs).
- [optional] Convert the accented characters using the above mapping.
- Run through your input string character-by-character.
- It may be faster to do #1 and #2 per-character instead of on the whole string, depending on what built-in functions you have.
- If the character is in the range a-z or 0-9, add it to your new string, otherwise:
a) If you already have a hyphen on the end of your new string, ignore it
b) If not, add a hyphen to the end of the string. - When you get to the end, remove and leading or trailing hyphens and you're done!
Solution 4
You could add a new field to the Products table that contained an URL safe and unique name for each product. This could probably be automatically generated initially (substituting non-safe characters with closest safe equivalent - gora-aldre
?) and then fine tuned as needed.
Since the replacement of non-safe characters is not (always) reversible, it isn't entirely feasible to do this kind of thing on the fly.
Alternatively, you build the URL thusly:
http://example.com/products/1234/safe-string
Where safe-string
is created on the fly replacing unsafe characters as needed. The number 1234
is the product key. You use the key to look up the product, the 'safe-string' is there more for the user and search engines.
Solution 5
Two things to keep in mind:
URL rewriting generally does not have a positive effect on search engines (and frequently a negative one) -- so you should only do it if you know of a measurable positive effect on user satisfaction (and accordingly: make your URLs useful for the users).
If you do decide to do URL rewriting, you must have the technical details down perfectly. For instance, you should never have more than one unique URL showing the same content. Make sure you use UTF-8 for the encoding of non-ASCII content, use escaped links within your content, and generally test on various browsers to make sure things work as planed. If any of this is foreign to you, then I would strongly recommend not doing URL rewriting for the moment.
FWIW Some of the search engine side issues are covered at http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html
Related videos on Youtube

Anthony
Updated on September 18, 2022Comments
-
Anthony 4 months
Say I have this webpage:
http://ww.xyz.com/Product.aspx?CategoryId=1
If the name of CategoryId=1 is "Dogs" I would like to convert the URL into something like this:
http://ww.xyz.com/Products/Dogs
The problem is if the category name contains foreign (or invalid for a url) characters. If the name of CategoryId=2 is "Göra äldre", what should be the new URL?
Logically it should be:
http://ww.xyz.com/Products/Göra äldre
but it will not work.Firstly because of the space (which I can easily replace by a dash for example) but what about the foreign characters? In Asp.net I could use the URLEncode function which would give something like this:
http://ww.xyz.com/Products/G%c3%b6ra+%c3%a4ldre
but I can't really say it's better than the original URL (http://ww.xyz.com/Product.aspx?CategoryId=2
).Ideally I would like to generate this one but how can I can do this automatically (ie converting foreign characters to 'safe' URL characters):
http://ww.xyz.com/Products/Gora-aldre
. -
Anthony over 12 yearsThe Products table is just an example and the actual string will not necessarily come from our database so I need a way to do this automatically. For example on this site the url is generated on the fly based on the question asked. How do they automatically remove non-safe characters if those non-safe characters are used in the question itself?
-
Kris over 12 years@Anthony You create and save the url safe version when the 'question' is created or you go with my alternative (see edit)
-
Anthony over 12 yearsThanks but how do you do this exactly : 'safe-string is created on the fly replacing unsafe characters as needed'?
-
David Z over 12 yearsIn my experience Python makes this far easier than most other languages, because of the
unicodedata
module.