Replication slaves lock up when master disk is full

So, one of master server’s disk filled up with replication logs last night. We had a permission issue due to our data center move. An oversight on my part. Anyhow, I got to the server and purged the old logs and the master server started responding with no problems. However, the slaves did not. Both slaves connected to the master would not respond to a slave stop. Nor would they respond to a mysqladmin shutdown. I had to kill -9 the daemons to get them to stop. When I restarted the daemon and started the slave, all was fine.

I am hoping someone knows what is going on or can tell me what to send in a bug report. I don’t want to just put that in a bug report. It is kind of lame and useless.

Update: This may be my problem. Bug #31024.  I certainly did not give the slave a long time to stop.


4 Responses to Replication slaves lock up when master disk is full

  1. burtonator says:

    This is a really sad and pathetic bug.

    I’ve been working on a list of common distributed system bugs and this is going on the list.

    You can fix it on your end though.

    Look at the connect_timeout variable.

    It defaults to 3600 seconds but I reduce it down to 30 seconds.

    When you run SLAVE STOP it won’t break the timeout but it will stop after 30 seconds.

    Other bugs I’ve seen on this topic:

    * infinite DNS caching
    * DNS caching within the app
    * Infinite or LONG read timeouts.

    … etc.

  2. doughboy says:

    No burtonator, that is not it.

    # mysqladmin var | fgrep connect_timeout
    | connect_timeout | 5 |

    I gave the slaves at least a minute.

    If you follow the bug report I linked in my updated, it is explained there in more detail. Also, the patch has been commited to fix this in future releases. Open Source scores a point there. My issue is confirmed, I see the patch and know the fix is coming.

  3. burtonator says:

    OK… I read the bug report. Maybe I’m missing something.

    It’s a connect related issue and the connect timeout would control why you can’t move forward while blocking on IO.

    Did you set connect_timeout before or after the connect attempt?

    This bug just seems like a fix to kill the connect.

    I should look at the patch …..

  4. doughboy says:

    See this bug for slave’s connect timeout.

%d bloggers like this: