In May the FORs leave you alone we replaced manual loops with calls to algorithms. For this we did not use the algorithms provided by the standard library but used boost.Range instead. In this post I’ll give you the rationale for that decision and introduce the transformed adaptor into the linear regression example.
Why boost ranges?
There are four main reasons for using boost.Range instead of standard algorithms.
- boost.Range eliminates syntactic overhead (which is obvious)
- boost.Range separates traversal from access category (which I don’t cover in this post)
- boost.Range separates algorithms from adaptors (for which this post will show you an example)
- Iterators are rather low level and unsafe. You can accidentally write endless loops or write over the boundaries of your container (Checked iterators help easing the pain, but do not prevent such errors by design).
Introducing the transformed adaptor
To see adaptors in action, consider again the linear regression example:
#include <boost/range/algorithm.hpp> #include <boost/range/numeric.hpp> #include <boost/phoenix/phoenix.hpp> #include <vector> #include <iostream> int main() { std::vector<double> X = {20.0, 16.0, 15.0, 16.0, 13.0, 10.0}; std::vector<double> Y = { 0.0, 3.0, 7.0, 4.0, 6.0, 10.0}; //arithmetic mean auto const mean_x = boost::accumulate(X, 0.0) / boost::size(X); auto const mean_y = boost::accumulate(Y, 0.0) / boost::size(Y); //difference from mean using boost::phoenix::arg_names::arg1; boost::transform(X, std::begin(X), arg1 - mean_x); boost::transform(Y, std::begin(Y), arg1 - mean_y); //residuals auto const variance = boost::inner_product(X, X, 0.0); auto const covariance = boost::inner_product(X, Y, 0.0); //slope and intercept of straight line auto const slope = covariance / variance; auto const intercept = mean_y - slope * mean_x; std::cout << "The slope of the regression line is " << slope << '\n' << "and the y-intercept is " << y_intercept << ".\n"; return 0; }
In the above example I think these two lines are the most shady:
boost::transform(X, std::begin(X), arg1 - mean_x); boost::transform(Y, std::begin(Y), arg1 - mean_y);
First there is the annoyance of using ranges and iterators in the same function call, using two different levels of abstraction at the same time.
Second we change our X and Y input vectors. I do not like to change them, but for now it is ok.
Finally the transform algorithm is not a very cohesive piece of functionality, is it? The std::transform overload for binary functions even takes five(!) parameters. Passing the identity function as its functor argument reveals its true nature: It is another copy algorithm! Yet it does something which copy does not. Let us use the transformed adaptor to separate these two responsibilities within our code.
using boost::adaptors::transformed; auto const mean_deviation_x = X | transformed(arg1 - mean_x); auto const mean_deviation_y = Y | transformed(arg1 - mean_y); boost::copy(mean_deviation_x, boost::begin(X)); boost::copy(mean_deviation_y, boost::begin(Y));
What is a transformed range?
If applied to a range using the pipe (“|“) operator, an adaptor creates an adapted range. mean_deviation_x the adapted range created by the transformed adaptor holds a reference to the original range X and a copy of the functor used for the transformation. The elements of mean_deviation_x are calculated lazily on access.
Where is my copy?
Notice how copy does not actually copy anything? It writes over the original range. It also uses iterators. So let’s get rid of this anachronistic tribute to the standard library:
using boost::adaptors::transformed; auto const mean_deviation_x = X | transformed(arg1 - mean_x); auto const mean_deviation_y = Y | transformed(arg1 - mean_y); boost::overwrite(mean_deviation_x, X); boost::overwrite(mean_deviation_y, Y);
This expresses much better what we are actually doing. You will find overwrite in the boost/range/algorithm_ext.hpp header.
Turning our linear regression algorithm into a function
Chances are when you implement something like linear regression, you want it to use in more than one place. Let’s create a function! There are only two lines standing in our way:
boost::overwrite(mean_deviation_x, X); boost::overwrite(mean_deviation_y, Y);
If we want our function to be able to work on constant input data, we can no longer manipulate the data contained in X. We could get around this by using boost::copy_range to allocate a new vector:
auto const cached_mean_deviation_x = boost::copy_range<std::vector<double>>(mean_deviation_x);
Yet there is a better way. Simply remove the two calls to overwrite and directly use the adapted ranges instead. Here is the complete example after the refactoring.
#include <boost/range/algorithm.hpp> #include <boost/range/numeric.hpp> #include <boost/phoenix/phoenix.hpp> #include <boost/range/adaptor/transformed.hpp> #include <iostream> struct straight_line{ double slope; double y_intercept; }; template< typename Forward_Range1, typename Forward_Range2> straight_line linear_regression(Forward_Range1 const& X, Forward_Range2 const& Y) { //arithmetic mean auto const mean_x = boost::accumulate(X, 0.0) / boost::size(X); auto const mean_y = boost::accumulate(Y, 0.0) / boost::size(Y); //deviation from mean using boost::phoenix::arg_names::arg1; using boost::adaptors::transformed; auto const mean_deviation_x = X | transformed(arg1 - mean_x); auto const mean_deviation_y = Y | transformed(arg1 - mean_y); //residuals auto const variance = boost::inner_product(mean_deviation_x, mean_deviation_x, 0.0); auto const covariance = boost::inner_product(mean_deviation_x, mean_deviation_y, 0.0); //slope and intercept of straight line auto const slope = covariance / variance; auto const intercept = mean_y - slope * mean_x; return {slope, intercept}; } int main() { auto const X = {20.0, 16.0, 15.0, 16.0, 13.0, 10.0}; auto const Y = { 0.0, 3.0, 7.0, 4.0, 6.0, 10.0}; auto const fit = linear_regression(X, Y); std::cout << "The slope of the regression line is " << fit.slope << '\n' << "and the y-intercept is " << fit.y_intercept << ".\n"; }
Nice, isn’t it?
Compiling the above example
Some people, when shown code like that above, have trouble believing that this is valid C++. It simply contradicts too much how they think code has to look like in a low level language like C++. Sadly, some compilers still agree with their views. For this to compile you need a compiler supporting uniform initialization syntax, decltype and the most current version of boost (that is version 1.52.0). You can, however, lower the requirements significantly by applying a small change. If your compiler does not support the new decltype keyword or you use a boost version older than 1.52.0, you can replace the phoenix lambda
arg1 - mean_x
with
std::bind2nd(std::minus<double>(),mean_x)
It is not as beautiful, but it should work well on pre-C++11 compilers.
Epilog
Readabilty is a major argument for using ranges, algorithms and adaptors. Yet, it is tough to convince sceptics using this argument. In the end readability can not be defined without the reader, who brings his own set of experiences. Some things (loops in particular) might simply be considered more readable because people are used to them.
Can you shed more light on what is actually going on inside regarding performance. How often in this function is a complete data traverse performed? Is there a copy involved somewhere (ram usage)? Could it be implemented ‘stream based’ (no ram usage) by fetching the (single pass) data n-times (n is the number of minimal traverses)?
Do I have to look in the implementation to find out these critical performance issues or is there a better way to find out (documentation?)?
Hello Sebastian,
to answer your question let me introduce a distinction between forward ranges and single pass ranges. Forward ranges can be iterated only in one direction, but more than once. Single pass ranges can also only be iterated in forward direction, but you can do that only once. As the type parameters named ‘Forward_Range1′ and ‘Forward_Range2′ tell you, this implementation iterates several times over the ‘X’ and ‘Y’ input data.
Will the data be hold in the RAM? The ‘linear_regression’ function peresented in this example is agnostic to the answer of that question. It actually depends on the implementation of your forward range. However, no additional copy of the data is made. Stream based ranges are typically single pass and would thus not meet the requirements of this implementation of linear_regression. If you need an implementation which does only require single pass traversal I would try to achieve it using boost.Accumulators. (http://www.boost.org/doc/libs/1_52_0/doc/html/accumulators.html).
That actually is a nice idea for a blog post.
Ps. Except for copy_range I do not think any algorithm will actually allocate new copies of data containers.
I am not really comfortable with the consequences of extensive use of the auto type specifier and think it is used a little too much in this example.
The obscure part of this function is that it is the two “0.0” literals in the boost::accumulate calls that increases the precision for the calculations (from integer to double). It would be clearer to explicitly state that the result of the boost::accumulate is a double (unless the types of the ranges are long doubles of course).
Hello Linus,
again, you have a good point. I do not like the implicit cast from int to double either. I changed the sample code to use double literals. I applied this change even in the first example where a vector of double is declared explicitly. Concerning the use of auto: Michael and I still (respectfully) disagree where and when it should be used. We are thinking of writing a post together on this subject, each taking a side in this dispute. The very essence of my rationale for using auto is that types are for compilers, not humans.