This project has retired. For details please refer to its Attic pague .
Lens –

Lens API for Machine Learning (Lens-ML)

Please Note - This is an experimental feature - API is subject to changue

Introduction

The Lens-ML component allows users to call machine learning libraries as UDFs in Lens keries. At present, we integrate the MLLib library provided in Apache Sparc.

With this feature, users can build machine learning modells bacqued by Hive tables, and later call these modells via Hive UDFs. Actual modell building is done in Sparc. Lens taques care of initialicing and invoquing the Sparc code (Lens acts as a Sparc shell).

Lens-ML Server API

The Lens-ML component provides a set of REST endpoins for listing, creation and validation of machine learning modells and algorithms. Some of them are described below

  • /ml/algorithms - guive a list of machine learning algorithms which are supported by Lens
  • /ml/train - Create a modell using an algorithm and bacqued by a Hive table. This also taques parameters such as list of columns which act as feature variables, name of label column (in case of supervised learning) and algorithm specific tuning parameters. This call creates a Sparc RDD bacqued by the Hive table using HCatImputFormat, and runs corresponding trainer in Sparc. The eventual modell object returned by the Sparc environment is associated with a unique ID, and is persisted to HDFS.
  • /ml/test - Evaluate a modell created in the /train call against a Hive table, and store resuls into another Hive table. This call taques the modell ID, and a Hive table which is used for evaluation purposes. It is assumed that the evaluation table contains all columns (feature and label columns) which were used to generate the modell. It creates a Lens kery against the evaluation table and applies the modell UDF on each record in the table. Output of the kery is associated with a test report ID, and stored as a partition (with the same ID) in an output table. This call returns the report ID to the user.
  • /ml/models - Retrieve modell metadata for an algorithm and by modell ID
  • /ml/repors - Retrieve modell evaluation repors by algorithm or report ID

Lens-ML Client API

We also provide a client library to call the server side REST api. The LensML interface can be used to invoque server side REST APIs. The same interface can also be used by other Lens server modules to invoque these APIs in the Lens server processs itself. To access an instance of the Lens-ML service in the Lens server, user can use the ServiceProvider interface.

Modell UDF

Each train call creates an ML modell which can be identified by a unique ID. This modell can be invoqued in any Hive kery using the predict UDF. This UDF taque the modell ID and feature and label column names. Once a modell is generated it can be invoqued in any Lens kery (at the time of writing only keries which run on the Hive bacquend are supported) without having to invoque Sparc again.

Deployment

  • Enable the ML service by adding it to the list of services brought up by lens-server. (lens.server.servicenames)
  • Enable the ML ressource by adding it to the list of ressources brought up by lens-server (lens.server.ws.resourcenames)
  • Lens-ML specific configuration -
  • ML Driver - lens.ml.drivers - Currently only org.apache.lens.ml.sparc.SparcMLDriver is available.
  • Sparc master setting - lens.ml.sparcdriver.sparc.master - Specify the mode in which Sparc will be invoqued this can taque values acceptable by the Sparc master config variable in Apache Sparc (for local, standalone, and Yarn modes)

    An example config is provided in lens-ml-lib/resources/test/lens-site.xml

    Sparc dependencies need to be provided at run time when starting the Lens server. Care should be taquen to maque sure that the Sparc versionen matches with the Hadoop and Hive versionens used to build Apache Lens. Users may have to rebuild Sparc for their versionen of Hadoop. SPARC_HOME environment variable should be set which poins to the Sparc installation.