Tactics for using PHP in a high-load site

26,095

Solution 1

No two sites are alike. You really need to get a tool like jmeter and benchmark to see where your problem points will be. You can spend a lot of time guessing and improving, but you won't see real results until you measure and compare your changes.

For example, for many years, the MySQL query cache was the solution to all of our performance problems. If your site was slow, MySQL experts suggested turning the query cache on. It turns out that if you have a high write load, the cache is actually crippling. If you turned it on without testing, you'd never know.

And don't forget that you are never done scaling. A site that handles 10req/s will need changes to support 1000req/s. And if you're lucking enough to need to support 10,000req/s, your architecture will probably look completely different as well.

Databases

  • Don't use MySQLi -- PDO is the 'modern' OO database access layer. The most important feature to use is placeholders in your queries. It's smart enough to use server side prepares and other optimizations for you as well.
  • You probably don't want to break your database up at this point. If you do find that one database isn't cutting, there are several techniques to scale up, depending on your app. Replicating to additional servers typically works well if you have more reads than writes. Sharding is a technique to split your data over many machines.

Caching

  • You probably don't want to cache in your database. The database is typically your bottleneck, so adding more IO's to it is typically a bad thing. There are several PHP caches out there that accomplish similar things like APC and Zend.
  • Measure your system with caching on and off. I bet your cache is heavier than serving the pages straight.
  • If it takes a long time to build your comments and article data from the db, integrate memcache into your system. You can cache the query results and store them in a memcached instance. It's important to remember that retrieving the data from memcache must be faster than assembling it from the database to see any benefit.
  • If your articles aren't dynamic, or you have simple dynamic changes after it's generated, consider writing out html or php to the disk. You could have an index.php page that looks on disk for the article, if it's there, it streams it to the client. If it isn't, it generates the article, writes it to the disk and sends it to the client. Deleting files from the disk would cause pages to be re-written. If a comment is added to an article, delete the cached copy -- it would be regenerated.

Solution 2

I'm a lead developer on a site with over 15M users. We have had very little scaling problems because we planned for it EARLY and scaled thoughtfully. Here are some of the strategies I can suggest from my experience.

SCHEMA First off, denormalize your schemas. This means that rather than to have multiple relational tables, you should instead opt to have one big table. In general, joins are a waste of precious DB resources because doing multiple prepares and collation burns disk I/O's. Avoid them when you can.

The trade-off here is that you will be storing/pulling redundant data, but this is acceptable because data and intra-cage bandwidth is very cheap (bigger disks) whereas multiple prepare I/O's are orders of magnitude more expensive (more servers).

INDEXING Make sure that your queries utilize at least one index. Beware though, that indexes will cost you if you write or update frequently. There are some experimental tricks to avoid this.

You can try adding additional columns that aren't indexed which run parallel to your columns that are indexed. Then you can have an offline process that writes the non-indexed columns over the indexed columns in batches. This way, you can control better when mySQL will need to recompute the index.

Avoid computed queries like a plague. If you must compute a query, try to do this once at write time.

CACHING I highly recommend Memcached. It has been proven by the biggest players on the PHP stack (Facebook) and is very flexible. There are two methods to doing this, one is caching in your DB layer, the other is caching in your business logic layer.

The DB layer option would require caching the result of queries retrieved from the DB. You can hash your SQL query using md5() and use that as a lookup key before going to database. The upside to this is that it is pretty easy to implement. The downside (depending on implementation) is that you lose flexibility because you're treating all caching the same with regard to cache expiration.

In the shop I work in, we use business layer caching, which means each concrete class in our system controls its own caching schema and cache timeouts. This has worked pretty well for us, but be aware that items retrieved from DB may not be the same as items from cache, so you will have to update cache and DB together.

DATA SHARDING Replication only gets you so far. Sooner than you expect, your writes will become a bottleneck. To compensate, make sure to support data sharding early as possible. You will likely want to shoot yourself later if you don't.

It is pretty simple to implement. Basically, you want to separate the key authority from the data storage. Use a global DB to store a mapping between primary keys and cluster ids. You query this mapping to get a cluster, and then query the cluster to get the data. You can cache the hell out of this lookup operation which will make it a negligible operation.

The downside to this is that it may be difficult to piece together data from multiple shards. But, you can engineer your way around that as well.

OFFLINE PROCESSING Don't make the user wait for your backend if they don't have to. Build a job queue and move any processing that you can offline, doing it separate from the user's request.

Solution 3

I've worked on a few sites that get millions/hits/month backed by PHP & MySQL. Here are some basics:

  1. Cache, cache, cache. Caching is one of the simplest and most effective ways to reduce load on your webserver and database. Cache page content, queries, expensive computation, anything that is I/O bound. Memcache is dead simple and effective.
  2. Use multiple servers once you are maxed out. You can have multiple web servers and multiple database servers (with replication).
  3. Reduce overall # of request to your webservers. This entails caching JS, CSS and images using expires headers. You can also move your static content to a CDN, which will speed up your user's experience.
  4. Measure & benchmark. Run Nagios on your production machines and load test on your dev/qa server. You need to know when your server will catch on fire so you can prevent it.

I'd recommend reading Building Scalable Websites, it was written by one of the Flickr engineers and is a great reference.

Check out my blog post about scalability too, it has a lot of links to presentations about scaling with multiple languages and platforms: http://www.ryandoherty.net/2008/07/13/unicorns-and-scalability/

Solution 4

Re: PDO / MySQLi / MySQLND

@gary

You cannot just say "don't use MySQLi" as they have different goals. PDO is almost like an abstraction layer (although it is not actually) and is designed to make it easy to use multiple database products whereas MySQLi is specific to MySQL conections. It is wrong to say that PDO is the modern access layer in the context of comparing it to MySQLi because your statement implies that the progression has been mysql -> mysqli -> PDO which is not the case.

The choice between MySQLi and PDO is simple - if you need to support multiple database products then you use PDO. If you're just using MySQL then you can choose between PDO and MySQLi.

So why would you choose MySQLi over PDO? See below...

@ross

You are correct about MySQLnd which is the newest MySQL core language level library, however it is not a replacement for MySQLi. MySQLi (as with PDO) remains the way you would interact with MySQL through your PHP code. Both of these use libmysql as the C client behind the PHP code. The problem is that libmysql is outside of the core PHP engine and that is where mysqlnd comes in i.e. it is a Native Driver which makes use of the core PHP internals to maximise efficiency, specifically where memory usage is concerned.

MySQLnd is being developed by MySQL themselves and has recently landed onto the PHP 5.3 branch which is in RC testing, ready for a release later this year. You will then be able to use MySQLnd with MySQLi...but not with PDO. This will give MySQLi a performance boost in many areas (not all) and will make it the best choice for MySQL interaction if you do not need the abstraction like capabilities of PDO.

That said, MySQLnd is now available in PHP 5.3 for PDO and so you can get the advantages of the performance enhancements from ND into PDO, however, PDO is still a generic database layer and so will be unlikely to be able to benefit as much from the enhancements in ND as MySQLi can.

Some useful benchmarks can be found here although they are from 2006. You also need to be aware of things like this option.

There are a lot of considerations that need to be taken into account when deciding between MySQLi and PDO. It reality it is not going to matter until you get to rediculously high request numbers and in that case, it makes more sense to be using an extension that has been specifically designed for MySQL rather than one which abstracts things and happens to provide a MySQL driver.

It is not a simple matter of which is best because each has advantages and disadvantages. You need to read the links I've provided and come up with your own decision, then test it and find out. I have used PDO in past projects and it is a good extension but my choice for pure performance would be MySQLi with the new MySQLND option compiled (when PHP 5.3 is released).

Solution 5

General

  • Do not try to optimize before you start to see real world load. You might guess right, but if you don't, you've wasted your time.
  • Use jmeter, xdebug or another tool to benchmark the site.
  • If load starts to be an issue, either object or data caching will likely be involved, so generally read up on caching options (memcached, MySQL caching options)

Code

  • Profile your code so that you know where the bottleneck is, and whether it's in code or the database

Databases

  • Use MYSQLi if portability to other databases is not vital, PDO otherwise
  • If benchmarks reveal the database is the issue, check the queries before you start caching. Use EXPLAIN to see where your queries are slowing down.
  • After the queries are optimized and the database is cached in some way, you may want to use multiple databases. Either replicating to multiple servers or sharding (splitting the data over multiple databases/servers) may be appropriate, depending on the data, the queries, and the kind of read/write behavior.

Caching

  • Plenty of writing has been done on caching code, objects, and data. Look up articles on APC, Zend Optimizer, memcached, QuickCache, JPCache. Do some of this before you really need to, and you'll be less concerned about starting off unoptimized.
  • APC and Zend Optimizer are opcode caches, they speed up PHP code by avoiding reparsing and recompilation of code. Generally simple to install, worth doing early.
  • Memcached is a generic cache, that you can use to cache queries, PHP functions or objects, or entire pages. Code must be specifically written to use it, which can be an involved process if there are no central points to handle creation, update and deletion of cached objects.
  • QuickCache and JPCache are file caches, otherwise similar to Memcached. The basic concept is simple, but also requires code and is easier with central points of creation, update and deletion.

Miscellaneous

  • Consider alternative web servers for high load. Servers like lighthttp and nginx can handle large amounts of traffic in much less memory than Apache, if you can sacrifice Apache's power and flexibility (or if you just don't need those things, which often, you don't).
  • Remember that hardware is surprisingly cheap these days, so be sure to cost out the effort to optimize a large block of code versus "let's buy a monster server."
  • Consider adding the "MySQL" and "scaling" tags to this question
Share:
26,095

Related videos on Youtube

Ross
Author by

Ross

Updated on January 16, 2020

Comments

  • Ross
    Ross over 4 years

    Before you answer this I have never developed anything popular enough to attain high server loads. Treat me as (sigh) an alien that has just landed on the planet, albeit one that knows PHP and a few optimisation techniques.


    I'm developing a tool in PHP that could attain quite a lot of users, if it works out right. However while I'm fully capable of developing the program I'm pretty much clueless when it comes to making something that can deal with huge traffic. So here's a few questions on it (feel free to turn this question into a resource thread as well).

    Databases

    At the moment I plan to use the MySQLi features in PHP5. However how should I setup the databases in relation to users and content? Do I actually need multiple databases? At the moment everything's jumbled into one database - although I've been considering spreading user data to one, actual content to another and finally core site content (template masters etc.) to another. My reasoning behind this is that sending queries to different databases will ease up the load on them as one database = 3 load sources. Also would this still be effective if they were all on the same server?

    Caching

    I have a template system that is used to build the pages and swap out variables. Master templates are stored in the database and each time a template is called it's cached copy (a html document) is called. At the moment I have two types of variable in these templates - a static var and a dynamic var. Static vars are usually things like page names, the name of the site - things that don't change often; dynamic vars are things that change on each page load.

    My question on this:

    Say I have comments on different articles. Which is a better solution: store the simple comment template and render comments (from a DB call) each time the page is loaded or store a cached copy of the comments page as a html page - each time a comment is added/edited/deleted the page is recached.

    Finally

    Does anyone have any tips/pointers for running a high load site on PHP. I'm pretty sure it's a workable language to use - Facebook and Yahoo! give it great precedence - but are there any experiences I should watch out for?

    • Ross
      Ross over 12 years
      3.5 years later and I can't even remember what I was working on, I'd like to know what I thought was so cool too :)
    • Rimu Atkinson
      Rimu Atkinson over 11 years
      Let this be a lesson to you about premature optimization :)
  • troelskn
    troelskn over 15 years
    @writing to disk. You could even ditch the index.php and let Apache do the work for you, so that index.php is only called, if the path doesn't exist. You'd use mode_rewrite for this.
  • user2012801
    user2012801 over 15 years
    I switched from PDO to mysqli and regular queries started to execute exactly 2 times faster.
  • Sylverdrag
    Sylverdrag about 15 years
    +1 because it looks like an interesting idea. I don't know how well it works performance-wise
  • Alix Axel
    Alix Axel almost 15 years
    -1, PDO is significantly slower than the MySQLi or even the MySQL extension.
  • Daren Schwenke
    Daren Schwenke over 14 years
    PDO was much slower than mysqli and didn't work right for nested queries for me. Mysqli also supports server side prepares and bound parameters just like PDO.
  • Xeoncross
    Xeoncross over 14 years
    +1 because this is an interesting idea. Those callbacks could call my caching class for me!
  • symcbean
    symcbean over 14 years
    I can't believe this was accepted as an answer. Its not very good.
  • Talvi Watia
    Talvi Watia almost 14 years
    about:caching - images, css, htm and js will help, turn off cookies on images too!
  • Evan Plaice
    Evan Plaice over 13 years
    +1 There's a lot of good info here. I've been researching more on this topic lately and your answer falls in line with everything I've read. Memcache, caching, CDN for static content, reducing requests; all good stuff. I would also add, generate hashes on static content files (if your behind a CDN/cache) server-side so the updated files have a unique signature in the cache. Also, combine static source files (css, javascript) on the fly (and cache them with filename hashes) to cut down on requests. Also, generate thumbs dynamically (and store them in the cache)
  • Evan Plaice
    Evan Plaice over 13 years
    Google has created an apache module called mod_pagespeed that can handle all of the file concatenations, minification, file renaming to include hash, etc for all static content. It'll should only add a little processing overhead to the servers initially until the caches (and CDN(s)) are populated with most of the content. Also, for security, it's generally a bad idea to put tables that are publicly accessible (users) in the same database as tables than handle the back-end (if for some reason one of the tables were to be hacked).
  • Evan Plaice
    Evan Plaice over 13 years
    +1 Hands down, this should be the accepted answer. It's interesting that everything I've ever read about building databases always says "normalize all the data as much as possible" without mentioning the performance hit of doing joins. I've always intuitively felt that joins (especially multiple) added a lot of overhead but haven't heard any say it explicitly until now. I wish I better understood what you were talking about controlling when MySQL computes the indexes, it sounds like a very interesting hack.
  • Evan Plaice
    Evan Plaice over 13 years
    Data sharding is essential for databases that grow too big. Google (the company not the search engine) has a lot of interesting things to say about implementing sharding schemes. Offline processing is also huge when it comes down to limiting the number of database writes (and limiting the number of table index recalculations). I've seen lots of blogs (and I think even Stack Overflow) use this technique for their user-generated comment/feedback systems.
  • thesmart
    thesmart over 13 years
    Thank you for the comments. It is amazing that some argue for profiling middle-tier code when the VAST amount of execution time is spent in either data I/O or client-server I/O. An ubber complicated optimization saving 20% off execution-time of a PHP process that takes 40ms is pointless compared to simple 5% savings off of a 1s database query.
  • Stann
    Stann over 13 years
    @serg: care to post some tests to confirm this ?, because I seriously doubt that simply switching from PDO to mysqli would give you such a speed boost.
  • Alister Bulman
    Alister Bulman almost 13 years
    RequiredFullQuote: "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
  • cHao
    cHao about 11 years
    RequiredReallyFullQuote: "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."