All the tests described in this page have been obtained on an Intel Q6600, 2GB RAM machine, using a ST3320620AS hard drive. Unless where it's noted, NCQ was disabled for the drive and default parameters were used for the schedulers.
The picture above shows the aggregate throughput of 16, 32 and 64 jobs issuing synchronous, sequential, O_DIRECT, 256KiB block size reads on 64MiB files. The slightly higher throughput obtained by BFQ is not because of the scheduling algorithm, but is a consequence of the default settings of the two schedulers. BFQ does not assign fixed-size budgets, but tries to adapt them to the behavior of the application; in this case the tasks are getting always the maximum budget, which is (with the default settings) higher than the CFQ time-based equivalent. This means that there are less switches between different tasks and it is the only reason for the throughput difference.
In this case (uniform weights, uniform budget/slice lenghts) the order in which BFQ assigns its budgets and the order used by CFQ for its timeslices is very similar, it is quite close to a Round Robin among the tasks.
Here we are dealing with greedy readers, so the latency is hardly an issue (a dd-like application is not sensible to the latency of each single request). Nonetheless, for completeness, we have to note that the higher throughput of BFQ is paid in terms of higher completion latencies, as shown in the picture below.
Of course changing the parameters one can decrease the latencies of BFQ or increase the throughput of CFQ; BFQ's default parameters, with autotuning enabled, were chosen in order to have a full-sized budget, consumed at the observed disk peak rate, last no more than the CFQ default slice (100ms). Obviously not all the budgets are consumed that fast, so slices can last from a little bit less than 100ms to timeout_sync, which, by default, is 125ms. In other words, the default BFQ parameters map the CFQ defaults to the service domain slightly favoring the aggregate throughput over latency in this kind of workload.
To obtain the same behavior as CFQ in this test (and in general when dealing with synchronous sequential readers all having the same ioprio) one would have to manually set a timeout_sync equal to CFQ's slice_sync and a max_budget higher than the maximum amount of sectors that the disk can transfer in slice_sync msecs.
In general these are the guidelines to follow when tuning:
- timeout_sync: decreasing it should decrease the maximum budget that can be assigned to a task by the autotuning mechanism. Smaller budgets mean lower throughput and lower latencies.
- max_budget: manually set a value for the maximum budget. BFQ will still adapt the budgets to the task's behavior, but they will not exceed the user-supplied value.
Aggregate Throughput, Random Access
The picture above shows the aggregate throughput of 16, 32 and 64 jobs issuing synchronous, random, O_DIRECT, 256KiB block size reads on 64MiB files. Even this case is perfectly simmetric, but given the workload, the longer slices assigned from BFQ do not produce any significant difference.
The latencies are determined by the sync_timeout parameter, since all the tasks are not fast enough to consume their budget in time, and the behavior of the scheduler switches to a sort of time-domain fairness among the seeky tasks.
In the picture below we can see the same effect as in the previous case, with the round-robin schedule of the tasks and the slice duration in BFQ that reaches higher values than with CFQ. We do not consider this extra delay with respect to CFQ a problem, as, even in this case, the applications are not likely to be sensible to the completion latency of the requests.
To evaluate the fairness of the two schedulers we used three sequential readers with different priorities, accessing files at the two opposite ends of the disk, to maximise the seek times when switching between different processes. Even in this case the sequential readers issue synchronous O_DIRECT requests of 256KiB. The priorities are chosen to have a 1:2 weight ratio between the two low prio tasks, that read from 64MiB files, and the high prio one, that reads from a 128MiB file.
There is no documented relation between the ioprio values and the weight assigned to tasks by CFQ, so we had to look at the code to select two values givind the desired ratio. We chose 3:6 as it was the most balanced one WRT the default priority value. Please note that the other configurations would have given different throughputs, as there is no way in CFQ (without changing the slice_sync parameter) to decouple the weight ratio from the length of the slices assigned.
Since the sequential readers are all ``well-behaved'' and complete their budgets fast enough, BFQ can be fair in the service domain. CFQ's time-domain fairness is paid in the service domain as shown in the figure. Tasks 0 and 1 are the low prio ones, Task 2 is the high prio one.
NOTE: both the approaches have pro and cons; we just wanted to show that for applications where the service domain fairness is important, it can be achieved without incurring in significant penalties (at least in the workloads we tried).
The original BFQ paper contained an experimental evaluation of a streaming server setup. While the source code for the original experiment is avaliable on Paolo's page, here we have used a synthetic fio setup that tries to reproduce similar access patterns; the configuration file can be downloaded here.
There are 32 intermittent readers, with a thinktime of 40ms, issuing synchronous reads of 64KiB blocks from the initial region of the disk, while in the background there are 10 greedy readers, 5 in the internal region of the disk, 5 in the external one. The 32 on/off readers start after a small delay to avoid artifacts due to the concurrent creations of all the processes.
The picture below shows the completion latencies of the requests issued by the 32 on/off readers (the fictitious streams). BFQ shows an initial peak and then always remains under the values obtained by CFQ. The initial peak is caused by the fact that BFQ is still adapting itself to the requests being issued. As we said before, the default parameters for BFQ are chosen to match the throughput of CFQ, and this case shows that the feedback mechanism can compensate for the higher latencies that can result from that choice; needless to say, the peak can be lowered decreasing the value of timeout_sync.
The service of the streams terminates earlier with BFQ than with CFQ because the latter tends to privilege the greedy readers, reducing the bandwidth left to the on/off ones. Of course this would not have the same effect with a real streaming application, where the intermittent readers would have been periodic. This example shows a pattern which is likely to be seen in latency-sensitive applications.
The figure below shows an example of a tuned setup, with the on/off reader being assigned an ioprio of 3 instead of the default of 4 (both on BFQ and on CFQ), and with a timeout_sync of 100ms, the same as the CFQ slice_sync. The initial peak is still there, but it is comparable with the highest CFQ peak, and the average latency remains lower. CFQ continues to favor the backgound traffic, as the higher completion time for the on/off workload shows.
Last updated November 10th, 2008