How does Stack Overflow generate its SEO-friendly URLs?

regex language-agnostic seo friendly-url slug

42,951

Solution 1

Here's how we do it. Note that there are probably more edge conditions than you realize at first glance.

This is the second version, unrolled for 5x more performance (and yes, I benchmarked it). I figured I'd optimize it because this function can be called hundreds of times per page.

/// <summary>
/// Produces optional, URL-friendly version of a title, "like-this-one". 
/// hand-tuned for speed, reflects performance refactoring contributed
/// by John Gietzen (user otac0n) 
/// </summary>
public static string URLFriendly(string title)
{
    if (title == null) return "";

    const int maxlen = 80;
    int len = title.Length;
    bool prevdash = false;
    var sb = new StringBuilder(len);
    char c;

    for (int i = 0; i < len; i++)
    {
        c = title[i];
        if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
        {
            sb.Append(c);
            prevdash = false;
        }
        else if (c >= 'A' && c <= 'Z')
        {
            // tricky way to convert to lowercase
            sb.Append((char)(c | 32));
            prevdash = false;
        }
        else if (c == ' ' || c == ',' || c == '.' || c == '/' || 
            c == '\\' || c == '-' || c == '_' || c == '=')
        {
            if (!prevdash && sb.Length > 0)
            {
                sb.Append('-');
                prevdash = true;
            }
        }
        else if ((int)c >= 128)
        {
            int prevlen = sb.Length;
            sb.Append(RemapInternationalCharToAscii(c));
            if (prevlen != sb.Length) prevdash = false;
        }
        if (i == maxlen) break;
    }

    if (prevdash)
        return sb.ToString().Substring(0, sb.Length - 1);
    else
        return sb.ToString();
}

To see the previous version of the code this replaced (but is functionally equivalent to, and 5x faster), view revision history of this post (click the date link).

Also, the RemapInternationalCharToAscii method source code can be found here.

Solution 2

Here is my version of Jeff's code. I've made the following changes:

The hyphens were appended in such a way that one could be added, and then need removing as it was the last character in the string. That is, we never want “my-slug-”. This means an extra string allocation to remove it on this edge case. I’ve worked around this by delay-hyphening. If you compare my code to Jeff’s the logic for this is easy to follow.
His approach is purely lookup based and missed a lot of characters I found in examples while researching on Stack Overflow. To counter this, I first peform a normalisation pass (AKA collation mentioned in Meta Stack Overflow question Non US-ASCII characters dropped from full (profile) URL), and then ignore any characters outside the acceptable ranges. This works most of the time...
... For when it doesn’t I’ve also had to add a lookup table. As mentioned above, some characters don’t map to a low ASCII value when normalised. Rather than drop these I’ve got a manual list of exceptions that is doubtless full of holes, but it is better than nothing. The normalisation code was inspired by Jon Hanna’s great post in Stack Overflow question How can I remove accents on a string?.

The case conversion is now also optional.

public static class Slug
{
    public static string Create(bool toLower, params string[] values)
    {
        return Create(toLower, String.Join("-", values));
    }

    /// <summary>
    /// Creates a slug.
    /// References:
    /// http://www.unicode.org/reports/tr15/tr15-34.html
    /// https://meta.stackexchange.com/questions/7435/non-us-ascii-characters-dropped-from-full-profile-url/7696#7696
    /// https://stackoverflow.com/questions/25259/how-do-you-include-a-webpage-title-as-part-of-a-webpage-url/25486#25486
    /// https://stackoverflow.com/questions/3769457/how-can-i-remove-accents-on-a-string
    /// </summary>
    /// <param name="toLower"></param>
    /// <param name="normalised"></param>
    /// <returns></returns>
    public static string Create(bool toLower, string value)
    {
        if (value == null)
            return "";

        var normalised = value.Normalize(NormalizationForm.FormKD);

        const int maxlen = 80;
        int len = normalised.Length;
        bool prevDash = false;
        var sb = new StringBuilder(len);
        char c;

        for (int i = 0; i < len; i++)
        {
            c = normalised[i];
            if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
            {
                if (prevDash)
                {
                    sb.Append('-');
                    prevDash = false;
                }
                sb.Append(c);
            }
            else if (c >= 'A' && c <= 'Z')
            {
                if (prevDash)
                {
                    sb.Append('-');
                    prevDash = false;
                }
                // Tricky way to convert to lowercase
                if (toLower)
                    sb.Append((char)(c | 32));
                else
                    sb.Append(c);
            }
            else if (c == ' ' || c == ',' || c == '.' || c == '/' || c == '\\' || c == '-' || c == '_' || c == '=')
            {
                if (!prevDash && sb.Length > 0)
                {
                    prevDash = true;
                }
            }
            else
            {
                string swap = ConvertEdgeCases(c, toLower);

                if (swap != null)
                {
                    if (prevDash)
                    {
                        sb.Append('-');
                        prevDash = false;
                    }
                    sb.Append(swap);
                }
            }

            if (sb.Length == maxlen)
                break;
        }
        return sb.ToString();
    }

    static string ConvertEdgeCases(char c, bool toLower)
    {
        string swap = null;
        switch (c)
        {
            case 'ı':
                swap = "i";
                break;
            case 'ł':
                swap = "l";
                break;
            case 'Ł':
                swap = toLower ? "l" : "L";
                break;
            case 'đ':
                swap = "d";
                break;
            case 'ß':
                swap = "ss";
                break;
            case 'ø':
                swap = "o";
                break;
            case 'Þ':
                swap = "th";
                break;
        }
        return swap;
    }
}

For more details, the unit tests, and an explanation of why Facebook's URL scheme is a little smarter than Stack Overflows, I've got an expanded version of this on my blog.

Solution 3

You will want to setup a custom route to point the URL to the controller that will handle it. Since you are using Ruby on Rails, here is an introduction in using their routing engine.

In Ruby, you will need a regular expression like you already know and here is the regular expression to use:

def permalink_for(str)
    str.gsub(/[^\w\/]|[!\(\)\.]+/, ' ').strip.downcase.gsub(/\ +/, '-')
end

Solution 4

You can also use this JavaScript function for in-form generation of the slug's (this one is based on/copied from Django):

function makeSlug(urlString, filter) {
    // Changes, e.g., "Petty theft" to "petty_theft".
    // Remove all these words from the string before URLifying

    if(filter) {
        removelist = ["a", "an", "as", "at", "before", "but", "by", "for", "from",
        "is", "in", "into", "like", "of", "off", "on", "onto", "per",
        "since", "than", "the", "this", "that", "to", "up", "via", "het", "de", "een", "en",
        "with"];
    }
    else {
        removelist = [];
    }
    s = urlString;
    r = new RegExp('\\b(' + removelist.join('|') + ')\\b', 'gi');
    s = s.replace(r, '');
    s = s.replace(/[^-\w\s]/g, ''); // Remove unneeded characters
    s = s.replace(/^\s+|\s+$/g, ''); // Trim leading/trailing spaces
    s = s.replace(/[-\s]+/g, '-'); // Convert spaces to hyphens
    s = s.toLowerCase(); // Convert to lowercase
    return s; // Trim to first num_chars characters
}

Solution 5

For good measure, here's the PHP function in WordPress that does it... I'd think that WordPress is one of the more popular platforms that uses fancy links.

    function sanitize_title_with_dashes($title) {
            $title = strip_tags($title);
            // Preserve escaped octets.
            $title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
            // Remove percent signs that are not part of an octet.
            $title = str_replace('%', '', $title);
            // Restore octets.
            $title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
            $title = remove_accents($title);
            if (seems_utf8($title)) {
                    if (function_exists('mb_strtolower')) {
                            $title = mb_strtolower($title, 'UTF-8');
                    }
                    $title = utf8_uri_encode($title, 200);
            }
            $title = strtolower($title);
            $title = preg_replace('/&.+?;/', '', $title); // kill entities
            $title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
            $title = preg_replace('/\s+/', '-', $title);
            $title = preg_replace('|-+|', '-', $title);
            $title = trim($title, '-');
            return $title;
    }

This function as well as some of the supporting functions can be found in wp-includes/formatting.php.

View more solutions

42,951

wusher

If you already know what recursion is, remember the answer. Otherwise, find someone who is closer to me than you are; then ask them what it is.

Updated on September 11, 2020

Comments

wusher over 3 years
What is a good complete regular expression or some other process that would take the title:

How do you change a title to be part of the URL like Stack Overflow?

and turn it into
```
how-do-you-change-a-title-to-be-part-of-the-url-like-stack-overflow
```
that is used in the SEO-friendly URLs on Stack Overflow?

The development environment I am using is Ruby on Rails, but if there are some other platform-specific solutions (.NET, PHP, Django), I would love to see those too.

I am sure I (or another reader) will come across the same problem on a different platform down the line.

I am using custom routes, and I mainly want to know how to alter the string to all special characters are removed, it's all lowercase, and all whitespace is replaced.
- darkAsPitch almost 16 years
  
  What about funny characters? What are you going to do about those? Umlauts? Punctuation? These need to be considered. Basically, I would use a white-list approach, as opposed to the black-list approaches above: Describe which characters you will allow, which characters you will convert (to what?) and then change the rest to something meaningfull (""). I doubt you can do this in one regex... Why not just loop through the characters?
- casperOne over 12 years
  
  Should be migrated to meta; as the question and answer both specifically deal with SO implementation, and the accepted answer is from @JeffAtwood.
- Paŭlo Ebermann over 12 years
  
  @casperOne Do you think Jeff is not allowed some non-meta reputation? The question is about "how can one do something like this", not specifically "how is this done here".
- casperOne over 12 years
  
  @PaŭloEbermann: It's not about Jeff getting some non-meta reputation (how much reputation he has is really not my concern); the question body specifically referenced StackOverflow's implementation hence the rationale for it being on meta.
sjas almost 12 years

I really don't know what to say about this piece of information.
WhyNotHugo almost 12 years

While this works perfectly, I somehow feel it isn't very efficient.
Nikola Loncar almost 10 years

This is not full answer. You are missing functions like : remove_accents,seems_utf8...
NickG over 9 years

That's a really good example of when NOT to use a switch case statement.
Muhammad Rehan Saeed about 9 years

According to Wikipedia, Mozilla 1.4, Netscape 7.1, Opera 7.11 were among the first applications to support IDNA. A browser plugin is available for Internet Explorer 6 to provide IDN support. Internet Explorer 7.0 and Windows Vista's URL APIs provide native support for IDN. Sounds like removing UTF-8 characters is a waste of time. Long live UTF-8!!!
mickro almost 7 years

to complete @The How-To Geek answer you still can git clone git://core.git.wordpress.org/ and find the wp-includes/formatting.php file into
Skary over 2 years

I don't really see the pint of a dictionary if you have to scann it entirely.. why not just a list? if that list of structured data (i see you use dictionary like a tuple). if the list is also a constant, why not define it as follow outside of the method? IMHO in such scenario is much more efficent a constant dictionary as yours but with key and value switched : 'à'-> 'a' , ... , 'ą'->'a', ..., 'ç' -> 'c', ... ,'ž' -> 'z'