Ramblings of a web guy

Is Yahoo!ed a word?

Posted in Linux, MySQL, PHP, Programming by Brian Moon on December 22nd, 2006

Everyone has heard of being slashdotted or maybe dugg. But have you ever been Yahoo!ed?

Phones started beeping, mayhem ensued. The first thing we looked at was the database. Is some MyISAM table locked? Is there a hung log processor running? The database was busy, but it looked odd. The web servers were going nuts.

As we soon discoverd, we (dealnews.com) were mentioned in an article on Yahoo!. At 5Pm Eastern, that article made it to be the featured article on the Yahoo! front page. It was there for an hour. We went from our already high Christmas traffic of about 80 req/s for pages and 200 req/s for images to a 130 req/s for pages and 500 req/s for images. We survived with a little tinkering. We have been working on a proxy system and this sounded like as good a time as any to try it out. Thanks to the F5 BIG-IP load balancers, we could send all the traffic from Yahoo! to the proxy system. That allowed us to handle the traffic. Just after 6PM, Yahoo! changed the featured article and things returned to normal.

Until 9PM. It seems the earlier posting by Yahoo! must not have went out to all their users. Because at 9PM the connections came back with a vegance. We started hitting bottleneck after bottleneck. We would up one limit and another would bottleneck would appear. The site was doing ok during this time. Some things like images were loading slow. That was a simple underestimation of having our two image servers set to only 250 MaxClients. Their load was nothing. We upped that and images flowed freely once again. Next we realized that all our memcached daemons were maxed out on connections. So, again, we up that and restart them. That’s fixed now. Oh, now that we are not waiting on memcached, the Apache/PHP servers are hitting their MaxClients. We check the load and the servers are not stressed. So, up those limits go. The proxy servers were not doing well using a pool of memcached servers. So, we set them to use just one server each. This means several copies of the same cache, but better access to the data for each server. After all that, we were handling the Yahoo! load.

In the end, it was 300 req/s for pages and 3000 req/s for images. It lasted for over 2 hours. The funny thing is, we have been talking all week about how to increase our capacity before next Christmas. Given our content, this is our busy time. Our traffic has doubled each December for the last 3 years. At one point, during the Yahoo! rush, the incoming traffic was 10MB/s. A year and a half ago, that was the size of our whole pipe with our provider. Luckily we increased that a while back.

The silver lining is that I got to see this traffic first hand for over 2 solid hours. This will help us to design our systems to handle this load and then some all the time in the future. In some ways it was a blessing.

Digg? Slashdot? They can bring traffic for sure. We have been on both several times. But wow, just getting in the third paragraph of an article that is one page deep from the Yahoo! front page can bring you to your knees if you are not ready. But, in this business, I will do it again tomorrow. Bring it on.

Update:  Yahoo! put the article on their front page again on the 26th.  Both our head sys admin and I were off.  No phones went off.  We handled 400 req/s for the front pages and 1500 req/s for images.  This lasted for 3 hours.  Granted, some things were not working.  You could not change your default settings for the front page for example.  But, all in all, the site performed quite well.

Browser KeepAlive Secrets

Posted in Firefox, HTML, PHP, Programming by Brian Moon on December 5th, 2006

So, at dealnews, we are getting ready to launch a super secret thing (redesign beta preview) that requires us to use some cookie tricks.  What we decided to do was to give our users a link to a page that would set a cookie.  Then we configured our F5 BIG-IP load balancers to direct those users with the cookie set to a different pool (back end ip/port pairs).  Its not an original idea.  Yahoo! was doing something similar with their recent front page beta.  In fact, that is where I got the idea.

Well, it worked great in testing with mod_rewrite (buying a $40k device for testing is not in the budget right now) on my local machine and on the test servers.  We had no problems.  However, when we turned it all on in production using the BIG-IP we got some unexpected results.  We could go to the URL to set our cookie and our site would change.  On the redesigned page, there is another link to switch you back.  It simply deleted the cookie and redirected you.  Since the cookie was gone, you would be back to the old design, right?  WRONG!  You were stuck.  But, if you did not click on any links on the site for about, oh, 15 seconds, you would get back to the old design.  I should say at this point that Safari was the only browser that did not do this.  IE, Mozilla and Opera all had this problem.

Hmm, 15 seconds.  That is the default KeepAliveTimeout in Apache.   I took a chance and disabled keep alive in Firefox (about:config, search for keep, set to false).  BAM!  It all worked like a charm.  It seemed that IE, FF and Opera all keep your keep alive connection open even after the page is done loading.  And because the BIG-IP determined which pool you are connected to at connection time, you stayed connected with the new pool rather than switching back.  And, as long as you kept clicking around on our site, you would keep that connection open.

As for a solution, we decided to let Apache do the work for us.  We didn’t want to tell the BIG-IP to start disconnecting users on every request.  Instead, we used a Location directive and SetEnv to set the nokeepalive environment variable only when users access the page that sets/unsets the cookie.  Now Apache sends the Connection: close header and the browsers comply.  You can see an immediate difference too.  Firefox for example has a noticable pause while it closes the connection and makes a new one.  I am going to dig around in the BIG-IP manual some more to see if there is anything we can do to make this work at the load balancer layer.  But, I don’t really want my load balancers spending CPU cycles on something that will not be an issue once this redesign is launched.