You’re thinking too low level. Who cares if the disk fails. The entire shard is setup for high availability. Each server is redundant with 1-2 other boxes (depends on the number of replicas). If you have automated master promotion you’ll never notice any downtime. All the disks can fail in the server and a slave will be promoted to a new master.
Monitoring then catches that you have a failed server and you have operations repair it and put it back into production as a new slave.
Someone has to think low level. The key phrase in there is you have operations repair it and put it back into production as a new slave. This tells me all I need to know. Kevin later states that his company does in fact not operate their own equipment, but uses a provider for all their hosting.
At this point, I think this is a philosophy argument and not a real world application argument at this point. Sure, I guess if I am Google or Yahoo I can do this. But, for the mass majority of web sites running out there, having 4 data centers and “operations” at your beck and call is not a reality. For real people, having a server go down is pain in the ass. Why should I want to spend a full day of labor rebuilding a server because a $200 part broke or just got corrupted. It takes 10 minutes to start a rebuild and maybe another 10 minutes to install a new drive if the rebuild fails.
His other argument is about performance. Sure, its debatable whether RAID is faster or slower. It probably depends on the application. If your RAID is a bottle neck for your application, then you need to address that. For us, its far from the bottleneck so why bother with the downtime of having one (of our 30, not 1000) servers down.
BTW, would you rather admin 30 servers or 1000? I think 30.
I should add that we only use RAID on servers that are used for data storage. Losing data sucks. For web servers we don’t use RAID. They do fit the model that Kevin describes. We have a lot of them. If one goes down, its ok. Maybe Kevin’s application can fit all its data on one web node. Don’t know. I just know its right for us and I don’t see a future where I won’t want it on our servers. We are even using RAID in our MySQL Cluster servers. Why? Because I don’t want to have to wait a day to get a storage node back up and running for a $200 part.