Question 1 :
When and how much can MySQL replication improve the performance of my system?
MySQL replication is most beneficial for a system with frequent reads and not so frequent writes. In theory, by using a one master/many slaves setup you can scale by adding more slaves until you either run out of network bandwidth, or your update load grows to the point that the master cannot handle it.
In order to determine how many slaves you can get before the added benefits begin to level out, and how much you can improve performance of your site, you need to know your query patterns, and empirically (by benchmarking) determine the relationship between the throughput on reads (reads per second, or max_reads) and on writes max_writes) on a typical master and a typical slave. The example below will show you a rather simplified calculation of what you can get with replication for our imagined system.
Let's say our system load consists of 10% writes and 90% reads, and we have determined that max_reads = 1200 - 2 * max_writes, or in other words, our system can do 1200 reads per second with no writes, our average write is twice as slow as average read, and the relationship is linear. Let us suppose that our master and slave are of the same capacity, and we have N slaves and 1 master. Then we have for each server (master or slave):
reads = 1200 - 2 * writes (from bencmarks)
reads = 9* writes / (N + 1) (reads split, but writes go to all servers)
9*writes/(N+1) + 2 * writes = 1200
writes = 1200/(2 + 9/(N+1)
So if N = 0, which means we have no replication, our system can handle 1200/11, about 109 writes per second (which means we will have 9 times as many reads due to the nature of our application).
If N = 1, we can get up to 184 writes per second.
If N = 8, we get up to 400.
If N = 17, 480 writes.
Eventually as N approaches infinity (and our budget negative infinity), we can get very close to 600 writes per second, increasing system throughput about 5.5 times. However, with only 8 servers, we increased it almost 4 times already.
Note that our computations assumed infinite network bandwidth, and neglected several other factors that could turn out to be signficant on your system. In many cases, you may not be able to make a computation similar to the one above that will accurately predict what will happen on your system if you add N replication slaves. However, answering the following questions should help you decided whether and how much, if at all, the replication will improve the performance of your system:
What is the read/write ratio on your system?
How much more write load can one server handle if you reduce the reads?
How many slaves do you have bandwidth for on your network?
Question 2 :
How can I use replication to provide redundancy/high availability?
With the currently available features, you would have to set up a master and a slave (or several slaves), and write a script that will monitor the master to see if it is up, and instruct your applications and the slaves of the master change in case of failure. Some suggestions:
To tell a slave to change the master use the CHANGE MASTER TO command.
A good way to keep your applications informed where the master is by having a dynamic DNS entry for the master. With bind you can use nsupdate to dynamically update your DNS.
You should run your slaves with the log-bin option and without log-slave-updates. This way the slave will be ready to become a master as soon as you issue STOP SLAVE; RESET MASTER, and CHANGE MASTER TO on the other slaves. It will also help you catch spurious updates that may happen because of misconfiguration of the slave (ideally, you want to configure access rights so that no client can update the slave, except for the slave thread) combined with the bugs in your client programs (they should never update the slave directly).
We are currently working on intergrating an automatic master election system into MySQL, but until it is ready, you will have to create your own monitoring tools.
Question 3 :
MySQL - Troubleshooting Replication
If you have followed the instructions, and your replication setup is not working, first elliminate the user error factor by checking the following:
Is the master logging to the binary log? Check with SHOW MASTER STATUS. If it is, Position will be non-zero. If not, verify that you have given the master log-bin option and have set server-id.
Is the slave running? Check with SHOW SLAVE STATUS. The answer is found in Slave_running column. If not, verify slave options and check the error log for messages.
If the slave is running, did it establish connection with the master? Do SHOW PROCESSLIST, find the thread with system user value in User column and none in the Host column, and check the State column. If it says connecting to master, verify the privileges for the replication user on the master, master host name, your DNS setup, whether the master is actually running, whether it is reachable from the slave, and if all that seems ok, read the error logs.
If the slave was running, but then stopped, look at SHOW SLAVE STATUS output andcheck the error logs. It usually happens when some query that succeeded on the master fails on the slave. This should never happen if you have taken a proper snapshot of the master, and never modify the data on the slave outside of the slave thread. If it does, it is a bug, read below on how to report it.
If a query on that succeeded on the master refuses to run on the slave, and a full database resync ( the proper thing to do ) does not seem feasible, try the following:
First see if there is some stray record in the way. Understand how it got there, then delete it and run SLAVE START If the above does not work or does not apply, try to understand if it would be safe to make the update manually ( if needed) and then ignore the next query from the master.
If you have decided you can skip the next query, do SET SQL_SLAVE_SKIP_COUNTER=1; SLAVE START; to skip a query that does not use auto_increment, last_insert_id or timestamp, or SET SQL_SLAVE_SKIP_COUNTER=2; SLAVE START; otherwise If you are sure the slave started out perfectly in sync with the master, and no one has updated the tables involved outside of slave thread, report the bug, so you will not have to do the above tricks again.
Make sure you are not running into an old bug by upgrading to the most recent version.
If all else fails, read the error logs. If they are big, grep -i slave /path/to/your-log.err on the slave. There is no generic pattern to search for on the master, as the only errors it logs are general system errors - if it can, it will send the error to the slave when things go wrong.
When you have determined that there is no user error involved, and replication still either does not work at all or is unstable, it is time to start working on a bug report. We need to get as much info as possible from you to be able to track down the bug. Please do spend some time and effort preparing a good bug report. Ideally, we would like to have a test case in the format found in mysql-test/t/rpl* directory of the source tree. If you submit a test case like that, you can expect a patch within a day or two in most cases, although, of course, you mileage may vary depending on a number of factors.
Second best option is a just program with easily configurable connection arguments for the master and the slave that will demonstrate the problem on our systems. You can write one in Perl or in C, depending on which language you know better.
If you have one of the above ways to demonstrate the bug, use mysqlbug to prepare a bug report and send it to firstname.lastname@example.org. If you have a phantom - a problem that does occur but you cannot duplicate "at will":
Verify that there is no user error involved. For example, if you update the slave outside of the slave thread, the data will be out of sync, and you can have unique key violations on updates, in which case the slave thread will stop and wait for you to clean up the tables manually to bring them in sync.
Run slave with log-slave-updates and log-bin - this will keep a log of all updates on the slave.
Save all evidence before reseting the replication. If we have no or only sketchy information, it would take us a while to track down the problem. The evidence you should collect is:
All binary logs on the master
All binary log on the slave
The output of SHOW MASTER STATUS on the master at the time you have discovered the problem
The output of SHOW SLAVE STATUS on the master at the time you have discovered the problem
Error logs on the master and on the slave
Use mysqlbinlog to examine the binary logs. The following should be helpful to find the trouble query, for example:
mysqlbinlog -j pos_from_slave_status /path/to/log_from_slave_status | head
Once you have collected the evidence on the phantom problem, try hard to isolate it into a separate test case first. Then report the problem to email@example.com with as much info as possible.
Question 4 :
Getting Maximum Performance from MySQL
Optimization is a complicated task because it ultimately requires understanding of the whole system. While it may be possible to do some local optimizations with small knowledge of your system/application, the more optimal you want your system to become the more you will have to know about it.
So this chapter will try to explain and give some examples of different ways to optimize MySQL. But remember that there are always some (increasingly harder) additional ways to make the system even faster.
Question 5 :
MySQL - Optimization Overview
The most important part for getting a system fast is of course the basic design. You also need to know what kinds of things your system will be doing, and what your bottlenecks are.
The most common bottlenecks are:
Disk seeks. It takes time for the disk to find a piece of data. With modern disks in 1999, the mean time for this is usually lower than 10ms, so we can in theory do about 1000 seeks a second. This time improves slowly with new disks and is very hard to optimize for a single table. The way to optimize this is to spread the data on more than one disk. Disk reading/writing. When the disk is at the correct position we need to read the data. With modern disks in 1999, one disk delivers something like 10-20Mb/s. This is easier to optimize than seeks because you can read in parallel from multiple disks.
CPU cycles. When we have the data in main memory (or if it already were there) we need to process it to get to our result. Having small tables compared to the memory is the most common limiting factor. But then, with small tables speed is usually not the problem.
Memory bandwidth. When the CPU needs more data than can fit in the CPU cache the main memory bandwidth becomes a bottleneck. This is an uncommon bottleneck for most systems, but one should be aware of it.