This week felt like having Christmas and Birthday! On Monday Google open sourced their deep learning framework and on Friday Microsoft open sourced their distributed machine learning framework. First I thought they are competing projects!
After I skimmed through both projects I can say, they follow different approaches to tackle the problem of distributed machine learning. Microsofts DMTK (Distributed Machine Learning Toolkit) is an approach to distribute large, highly dimensional data for model training. At first glance, they use special sampling techniques to create and distribute training data throughout the cluster. The intention of DMTK is to provide a framework to build distributed algorithms on top – like the word embedding algorithm within the project. Right now they do not provide any samples or tutorials.
Googles TensorFlow is designed as a deep learning framework. Deep learning is a relatively new technology which had delivered very good results in handwriting, speech and image recognition. A few years back deep learning crushed the state of the art technologies in MNIST handwriting test. Today deep learning is used in speech recognition technologies like Siri, Cortana & Co. Ask your phone – it really works 😉
The tutorials and examples on tensorflow.org are very well written and very insightful! The architecture of TensorFlow is smart. TensorFlow uses an operations graph on which data is applied. A similar concept is used by Apache Spark and Apache Flink for data processing. Right now TensorFlow only supports single computer execution. Google plans to release a distributed version of TensorFlow to operate in clusters.
In my personal opinion TensorFlow looks a lot like Googles dataflow with focus on deep learning. Dataflow competes with Spark. We know that Spark has issues with memory and because it uses JVM it has some CPU overhead. On the other hand, TensorFlow is written in C++, which will be better in memory usage and CPU utilization (+ supports GPU). As Spark uses Mini-Batches, TensorFlow uses tensors to transfer and process data. The difference is that tensors only support integers, floating point numbers and strings (the primary data types for machine learning). My personal hypothesis is that TensorFlow was open sourced to define the standard for deep learning (the hottest topic in machine learning right now). If the standard is accepted, than Googles Cloud provides the computation resources for production deployments of TensorFlow. So Microsoft does not want to leave the market completely to Google and they answered with open sourcing their platform. 😉