So, I wrote about the begining of our wild database issues. Since then, I have been fighting a cold, coaching little league football and trying to help out in getting our backup solutions working in top shape. That does not leave much time for blogging.
Never again will we have ONLY a cold backup of anything. We were moving nightly full database dumps and hourly backups of critical tables over to that box all day long. Well, when the filesystem fails on both the primary database server and your cold backup server, you question everything. A day after my marathon drive to fix the backup server and get it up and running, the backup mysql server died again with RAID errors. I guess that was the problem all along. In the end, we had to have a whole new RAID subsystem in our backup database server. So, my coworker headed over to the data center to pull the all nighter to get the original, main database server up and running. The filesystem was completely shot. ReiserFS failed us miserably. It is no longer to be used at dealnews.
Well, today at 6:12PM, the main database server stops responding again. ARGH!! Input/Ouput errors. That means RAID based on last weeks experience. We reboot it. It reports memory or battery errors on the RAID card. So, I call Dell. Our warranty on these servers includes 4 hour, onsite service. They are important. While on the phone with Dell, I run the Dell diagnostic tool on the box. During the diagnostic test, the box shuts down. Luckily, the Dell service tech had heard enough. He orders a whole new RAID subsystem for this one as well.
There is one cool thing about the PERC4 (aka, LSI Megaraid) RAID cards in these boxes. They write the RAID configuration to the drives as well as on the card. So, when a new blank RAID card is installed, it finds the RAID config on the drives and boots the box up. Neato. I am sure all the latest cards do it. It was just nice to see it work.
So, box came up, but this time we had Innodb corruption. XFS did a fine job in keeping the filesystem in tact. So, we had to go from backups. But, this time we had a live replicated database that we could just dump and restore. We should have had it all along, but in the past (i.e. before widespread Innodb) we were gun shy about replication. We had large MyISAM tables that would constantly get corrupted on the master or slave and would halt replication on a weekly basis. It was just not worth the hassle. But, we have used it for over a year now in our front end database servers with an all Innodb data set. As of now, only two tables in our main database are not Innodb. And I am trying to drop the need for a Full-Text index on those right now.
So, here is to hoping our database problems are behind us. We have replaced almost everything in one except the chassis. The other has had all internal parts but a motherboard. Kudos to Dell’s service. The tech was done with the repair in under 4 hours. Glad to have that service. I recommend it to anyone that needs it.