Kinetica and TensorFlow Mechanics
Kinetica is an insight engine with a GPU-accelerated database that allows you to analyze data in real time as the database engine processes the data in GPU memory. TensorFlow is a library for machine learning and deep learning, that can leverage the power of GPUs as well. Bringing these two technologies together allows you to leverage real time data in machine learning applications, based on the data in your Kinetica database.
In this blog post we’ll describe the mechanics of how to make Kinetica work with TensorFlow, based on the popular MNIST image classification data, using a multilayer neural network. We’ll use the Kinetica API and User Defined Functions (UDFs). All required source code (Python) to make the example work can be downloaded here.
Introduction
The problem we’ll solve is to learn a model that represents the patterns of handwritten digits, based on labeled data – where for each image we know what digit that actually is. In machine learning terminology this is called training. We’ll then apply this model on new, unlabeled data, in order to classify (“predict”) for each input row what digit that most likely is. This is what we call inference. We’ll break up the process into four parts:
- Ingestion – getting the database filled and setup for this demo
- Training – learning the model
- Inference – applying the model
- Evaluation – how accurate are our classifications?
Ingestion
The tables we need
For our end-to-end process we need four tables to be setup:
- The table Mnist_training_input will contain all images of hand written digits, including their labels, that we’ll use to train the model.
- In Mnist_inference_input we’ll separate a smaller fraction of our original data. The model will later be applied on this data to make the classifications. This table also contains the known labels, so we’ll be able to evaluate how accurate the model is.
- We will output the classification results into the table Mnist_inference_result.
- The model itself will be stored in Mnist_train_output.
Note that we store the model-ID along with the model and the classifications. This potentially allows to easily train multiple (1000’s) of models and to combine their output in an ensemble.
Defining types
All tables are defined in a file DatabaseDefinitions.py, e.g. the type for the table to store the model has 5 columns, with the model column in binary format. The script Ingestion.py uses these type definitions to create the tables. This is all pure Kinetica API, no TensorFlow involved yet.
Running the ingestion
To run the table setup and ingestion we simply run the Ingestion.py script. This will store the images in binary format into the respective tables. Note that the label is stored in one-hot format as well as in readable format for convenience.
In the Kinetica UI the result should look like the below screenshot:
Training
Defining a UDF
The UDF for model training is defined in TrainUDF.py. This is pure TensorFlow code, except for the places where we read the data and where we store the model. This is achieved using another class – KineticaIO.py, which is a utility that facilitates such tasks. Calling the KineticaIO util happens in the main() method of TrainUDF.py through only two lines of code (reading data, storing the model).
Note that the main method starts with
proc_data = ProcData()
and ends with
proc_data.complete().
This tells Kinetica that the code in between is UDF code. The proc_data handle also allows to access tables and columns. This is not required in this UDF, but we’ll see an example later at the inference step.
Registering and executing a UDF
In Training.py we do the mechanics of registering and executing the UDF. Note that in this example the UDF is registered in “nondistributed” mode. We’ll use distributed mode later in the inference step. You could also distribute the training step, but then you’ll have to take care of combining the models, which we don’t do here for simplicity.
Running the training
All we need to do to trigger the training is to run Training.py. The result should be a model stored in the Mnist_train_output table:
Inference
InferenceUDF.py
Similar to what was done in training, we put the TensorFlow code for model application into a UDF, in between the proc_data handle. This time we also use the handle to access the table names of the input- and output data (which is “Mnist_inference_input” and “Mnist_inference_output” in this example). You can see how the different columns of the output table are accessed and filled with data – like the actual classifications, the model-ID, the known labels, etc.
Running the inference
We just need to trigger Inference.py. The inference is configured to run in distributed mode. The result should be the “Mnist_inference_output” table filled with the classifications, along with the true labels, so we’ll be able to evaluate the result.
Evaluation
KiSQL
The KiSQL interface in the Kinetica UI can be used for quick and easy evaluation. For example, the below query will show for each digit how many times it was classified correctly vs. incorrectly:
Reveal
For visual evaluation, we can use Reveal, the visualization framework within Kinetica. For example, a Sankey chart could be a good tool to display how incorrect vs. correct classifications are distributed. From the below visualization, we quickly can identify that most digits are classified correctly, but with our current model parameters, mostly the digits 8 and 9 are trouble makers:
We hope you enjoyed this posting. If you want to get started in your local environment you can download the Kinetica trial version here:
https://archive.kinetica.com/trial/
Two more useful pages to look into are the Kinetica API documentation (Python) in general:
https://archive.kinetica.com/docs/api/python/index.html
As well as the documentation about how to write custom functions (UDF’s) in particular:
https://archive.kinetica.com/docs/udf/python/writing.html
Stay tuned for further data science related postings!