Using machine learning to solve hard problems and building profitable businesses is almost mainstream now. This rise was accompanied by the introduction of several toolkits, frameworks and libraries, which made the developers’ job easy. Data-driven businesses usually run into two problems:
- Lack of data
- Too much data
In the first case, there are tools and approaches, often tedious, to scrape and gather data. However, in the latter case, a data surge will bring its own set of problems. These problems can range from feature engineering to storage to computational overkill.
Developers from Apache, Nvidia and other deep learning research communities have tried to ease the burden of vastness of AI pipelines by developing libraries that kickstarts multiple computations in a single line.
Here a few libraries that come in handy while dealing with large scale AI projects:
Ray Tune
Built in the labs of Berkeley AI, Tune was built to address the shortcomings of ad-hoc experiment execution tools. This was done by leveraging the Ray Actor API and adding failure handling.
Tune uses a master-worker architecture to centralize decision-making and communicates with its distributed workers using the Ray Actor API.
Ray provides an API that enables classes and objects to be used in parallel and distributed settings.
Tune uses a Trainable class interface to define an actor class specifically for training models. This interface exposes methods such as _train, _stop, _save, and _restore, which allows Tune to monitor intermediate training metrics and kill low-performing trials.
Dask For ML
Dask can address long training times and large datasets problems with Dask-ML makes it easy to use normal Dask workflows to prepare and set up data, then it deploys XGBoost or Tensorflow alongside Dask, and hands the data over.
In all cases Dask-ML endeavours to provide a single unified interface around the familiar NumPy, Pandas, and Scikit-Learn APIs. Users familiar with Scikit-Learn should feel at home with Dask-ML.
Dask also has methods from sklearn for hyperparameter search such as GridSearchCV, RandomizedSearchCV etc.
PyFlink
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink excels at processing unbounded and bounded data sets. Precise control of time and state enable Flink’s runtime to run any kind of application on unbounded streams.
Kafka-python
Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
Python client for the Apache Kafka distributed stream processing system. kafka-python is designed to function much like the official java client, with a sprinkling of pythonic interfaces (e.g., consumer iterators).
>>> pip install kafka-python
ScaleGraph
Mining graphs to discover hidden knowledge requires particular middleware and software libraries that can harness the full potential of large-scale computing infrastructures such as super computers.
The goal of ScaleGraph is to provide large-scale graph analysis algorithms and efficient distributed computing framework for graph analysts and for algorithm developers, respectively.
Apache MXNet
MXNet is a deep learning framework designed for both efficiency and flexibility.. At its core, MXNet contains a graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scaling effectively to multiple GPUs and multiple machines.
MXNet provides a comprehensive and flexible Python API to serve developers with different levels of experience and wide ranging requirements.
>>>pip install mxnet
cuBLAS
The NVIDIA cuBLAS library is a fast GPU-accelerated implementation. Using cuBLAS APIs, users can speed up your applications by deploying compute-intensive operations to a single GPU or scale up and distribute work across multi-GPU configurations efficiently.
TensorRT
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
rstoolbox
A Python library for large-scale analysis of computational protein design data and structural bioinformatics.
The rstoolbox is aimed at the analysis and management of big populations of protein or nucleotide decoys.
>>>pip install rstoolbox