MShadow中的表达式模板

博客分享

 0  204

优雅殿下 2022-02-27 20:55:53

悬赏：0 积分收藏

MShadow中的表达式模板

表达式模板是Eigen、GSL和boost.uBLAS等高性能C++矩阵库的核心技术。本文基于MXNet给出的教程文档来阐述MXNet所依赖的高性能矩阵库MShadow背后的原理。

编写高效的机器学习代码

我们先来思考一个问题：如何才能编写出高效的机器学习代码？假设DNN模型按照下面的代码进行权重更新，其中weight和grad都是长度为n的vector：

weight = -eta * (grad + lambda * weight)

既然我们选择C++来实现矩阵计算，那么性能肯定是最先要考虑的因素。在C++编程中，一个重要的原则就是——预先分配好所需的内存，不要在运行时申请分配临时内存。因此，我们可能会实现一个名为UpdateWeight的函数，其形参grad和weight均为指针：

void UpdateWeight(const float *grad, float eta, float lambda,                  int n, float *weight) {  for (int i = 0; i < n; ++i) {    weight[i] =  -eta * (grad[i] + lambda * weight[i]);  }}

指针grad和weight所指向的内存空间都是预先分配好的，函数在运行期只需要执行计算。显然，上面的代码非常简单直观，但是，当我们反复编写它们时可能会很烦（指编写循环处理每个元素）。因此问题是，我们能否像下面这样编写代码，同时又能获得上述代码的性能呢？答案是肯定的，但是方法可能并没有那么直观。

void UpdateWeight(const Vec& grad, float eta, float lambda, Vec& weight) {  weight = -eta * (grad + lambda * weight);}

运算符重载

运算符重载是一种非常容易想到的解决方案。通过重载相应的运算符，我们可以将元素处理的细节隐藏在运算符中，简单地调用运算符就可以实现相应的操作。

// Naive solution for vector operation overloadingstruct Vec {  int len;  float* dptr;  Vec(int len) : len(len) {    dptr = new float[len];  }  Vec(const Vec& src) : len(src.len) {    dptr = new float[len];    memcpy(dptr, src.dptr, sizeof(float)*len );  }  ~Vec(void) {    delete [] dptr;  }};inline Vec operator+(const Vec &lhs, const Vec &rhs) {  Vec res(lhs.len);  for (int i = 0; i < lhs.len; ++i) {    res.dptr[i] = lhs.dptr[i] + rhs.dptr[i];  }  return res;}

然而，这种方法并不高效，原因是每次调用运算符时都会有内存空间的申请和释放。另一种更高效的方法是仅重载运算符+=和-=，他们无需临时内存分配即可实现，但这又限制了我们可以调用的运算符的数量，得不偿失。下一小节，我们将介绍如何利用表达式模板实现延迟计算。

延迟计算

在调用operator+时，因为我们不知道运算符的结果要赋值给哪个变量，所以需要申请一块临时内存空间把结果保存下来。否则，如果我们能提前直到运算结果要存放在哪个变量中，那么就可以直接将结果存储到相应的内存空间。下面的代码说明了这一情况：

// Example Lazy evaluation code// for simplicity, we use struct and make all members public#include <cstdio>struct Vec;// expression structure holds the expressionstruct BinaryAddExp {  const Vec &lhs;  const Vec &rhs;  BinaryAddExp(const Vec &lhs, const Vec &rhs)  : lhs(lhs), rhs(rhs) {}};// no constructor and destructor to allocate and de-allocate memory,//  allocation done by userstruct Vec {  int len;  float* dptr;  Vec(void) {}  Vec(float *dptr, int len)      : len(len), dptr(dptr) {}  // here is where evaluation happens  inline Vec &operator=(const BinaryAddExp &src) {    for (int i = 0; i < len; ++i) {      dptr[i] = src.lhs.dptr[i] + src.rhs.dptr[i];    }    return *this;  }};// no evaluation happens hereinline BinaryAddExp operator+(const Vec &lhs, const Vec &rhs) {  return BinaryAddExp(lhs, rhs);}const int n = 3;int main(void) {  float sa[n] = {1, 2, 3};  float sb[n] = {2, 3, 4};  float sc[n] = {3, 4, 5};  Vec A(sa, n), B(sb, n), C(sc, n);  // run expression  A = B + C;  for (int i = 0; i < n; ++i) {    printf("%d:%f==%f+%f\\n", i, A.dptr[i], B.dptr[i], C.dptr[i]);  }  return 0;}

在这段代码中，运算符operator+不进行实际的计算，只返回一个表达式结构BinaryAddExp，它里面保存了进行向量加法的两个操作数。当重载operator=时，我们就可以知道向量加法的目标变量以及对应的两个操作数，因此，在这种情况下，不需要任何（运行时）内存分配就可以执行计算操作！类似的，我们可以定义一个DotExp并在operator=处进行惰性求值，然后在内部调用BLAS实现矩阵乘法。

更复杂的表达式以及表达式模板

使用延迟计算能够避免运行期的临时内存分配。但是，上一小节的代码仍然面临以下两个问题：

只能编写形如A=B+C的表达式，不能编写类似A=B+C+D等更加复杂的表达式
如果想添加更多的表达式，就需要编写更多的operator=来执行相应的计算

上述问题的解决方法就是使用模板编程。我们将BinaryAddExp实现成一个模板类，它保存的两个操作数都是模板，这样就能够实现任意长度的加法表达式，具体代码如下。

// Example code, expression template, and more length equations// for simplicity, we use struct and make all members public#include <cstdio>// this is expression, all expressions must inheritate it,//  and put their type in subtypetemplate<typename SubType>struct Exp {  // returns const reference of the actual type of this expression  inline const SubType& self(void) const {    return *static_cast<const SubType*>(this);  }};// binary add expression// note how it is inheritates from Exp// and put its own type into the template argumenttemplate<typename TLhs, typename TRhs>struct BinaryAddExp: public Exp<BinaryAddExp<TLhs, TRhs> > {  const TLhs &lhs;  const TRhs &rhs;  BinaryAddExp(const TLhs& lhs, const TRhs& rhs)      : lhs(lhs), rhs(rhs) {}  // evaluation function, evaluate this expression at position i  inline float Eval(int i) const {    return lhs.Eval(i) + rhs.Eval(i);  }};// no constructor and destructor to allocate// and de-allocate memory, allocation done by userstruct Vec: public Exp<Vec> {  int len;  float* dptr;  Vec(void) {}  Vec(float *dptr, int len)      :len(len), dptr(dptr) {}  // here is where evaluation happens  template<typename EType>  inline Vec& operator= (const Exp<EType>& src_) {    const EType &src = src_.self();    for (int i = 0; i < len; ++i) {      dptr[i] = src.Eval(i);    }    return *this;  }  // evaluation function, evaluate this expression at position i  inline float Eval(int i) const {    return dptr[i];  }};// template add, works for any expressionstemplate<typename TLhs, typename TRhs>inline BinaryAddExp<TLhs, TRhs>operator+(const Exp<TLhs> &lhs, const Exp<TRhs> &rhs) {  return BinaryAddExp<TLhs, TRhs>(lhs.self(), rhs.self());}const int n = 3;int main(void) {  float sa[n] = {1, 2, 3};  float sb[n] = {2, 3, 4};  float sc[n] = {3, 4, 5};  Vec A(sa, n), B(sb, n), C(sc, n);  // run expression, this expression is longer:)  A = B + C + C;  for (int i = 0; i < n; ++i) {    printf("%d:%f == %f + %f + %f\\n", i,           A.dptr[i], B.dptr[i],           C.dptr[i], C.dptr[i]);  }  return 0;}

代码的关键思想是模板Exp<SubType>将其派生类SubType的类型作为模板参数，因此它可以通过self()将其自身转换为SubType，这种模式被称为奇异递归模板模式，简称CRTP。BinaryAddExp现在是一个模板类，可以将多个表达式复合在一起。真正的计算操作是通过函数Eval完成的，该函数在BinaryAddExp中以递归方式实现，operator=中的函数调用src.Eval(i)会被编译成B.dptr[i] + C.dptr[i] + C.dptr[i]。

灵活性

前面的示例让我们领略到模板编程的强大功能，而这最后一个示例则与MShadow的实现更为接近，它允许用户自定义二元操作符。

// Example code, expression template// with binary operator definition and extension// for simplicity, we use struct and make all members public#include <cstdio>// this is expression, all expressions must inheritate it,// and put their type in subtypetemplate<typename SubType>struct Exp{  // returns const reference of the actual type of this expression  inline const SubType& self(void) const {    return *static_cast<const SubType*>(this);  }};// binary operatorsstruct mul{  inline static float Map(float a, float b) {    return a * b;  }};// binary add expression// note how it is inheritates from Exp// and put its own type into the template argumenttemplate<typename OP, typename TLhs, typename TRhs>struct BinaryMapExp: public Exp<BinaryMapExp<OP, TLhs, TRhs> >{  const TLhs& lhs;  const TRhs& rhs;  BinaryMapExp(const TLhs& lhs, const TRhs& rhs)      :lhs(lhs), rhs(rhs) {}  // evaluation function, evaluate this expression at position i  inline float Eval(int i) const {    return OP::Map(lhs.Eval(i), rhs.Eval(i));  }};// no constructor and destructor to allocate and de-allocate memory// allocation done by userstruct Vec: public Exp<Vec>{  int len;  float* dptr;  Vec(void) {}  Vec(float *dptr, int len)      : len(len), dptr(dptr) {}  // here is where evaluation happens  template<typename EType>  inline Vec& operator=(const Exp<EType>& src_) {    const EType &src = src_.self();    for (int i = 0; i < len; ++i) {      dptr[i] = src.Eval(i);    }    return *this;  }  // evaluation function, evaluate this expression at position i  inline float Eval(int i) const {    return dptr[i];  }};// template binary operation, works for any expressionstemplate<typename OP, typename TLhs, typename TRhs>inline BinaryMapExp<OP, TLhs, TRhs>F(const Exp<TLhs>& lhs, const Exp<TRhs>& rhs) {  return BinaryMapExp<OP, TLhs, TRhs>(lhs.self(), rhs.self());}template<typename TLhs, typename TRhs>inline BinaryMapExp<mul, TLhs, TRhs>operator*(const Exp<TLhs>& lhs, const Exp<TRhs>& rhs) {  return F<mul>(lhs, rhs);}// user defined operationstruct maximum{  inline static float Map(float a, float b) {    return a > b ? a : b;  }};const int n = 3;int main(void) {  float sa[n] = {1, 2, 3};  float sb[n] = {2, 3, 4};  float sc[n] = {3, 4, 5};  Vec A(sa, n), B(sb, n), C(sc, n);  // run expression, this expression is longer:)  A = B * F<maximum>(C, B);  for (int i = 0; i < n; ++i) {    printf("%d:%f == %f * max(%f, %f)\\n",           i, A.dptr[i], B.dptr[i], C.dptr[i], B.dptr[i]);  }  return 0;}

这段代码与上一小节代码的主要区别是模板类BinaryMapExp可以接受任意类型的二元操作符，要求是该操作符必须要实现一个Map函数。个人理解，第62行实现的那个F函数主要是为了编写代码方便，如果没有它，那么第69行就要写成BinaryMapExp<mul, TLhs, TRhs>(lhs.self(), rhs.self());，写起来就比较麻烦。其他的地方基本上与前一小节的代码差不多，稍微一看就能明白。

小结

综上所述，表达式模板基本工作原理包括以下几点：

延迟计算，允许我们提前知道运算符和目标变量
组合模板与递归计算，允许我们执行任意element-wise操作的复合表达式
由于模板和内联的存在，表达式模板能够像编写for循环那样高效的实现element-wise的计算