Solutions to IT problems

Solutions I found when learning new IT stuff

Posts Tagged ‘knime

Machine Learning for Beginners

leave a comment »

Introduction

This article is a very brief introduction into machine learning. It does not contain any mathematics or explanations on how machine learning works. In fact I myself am a complete novice regarding the mathematical backgrounds of machine learning. I will show a basic machine learning example using KNIME and the Iris Data set. This data set contains measurements of 3 types (classes) of Iris flowers. With this data you can create a model and then determine the type of flower just by measuring it. Be aware that this data set is very clean and simple. In more real-life examples, the data would need be cleaned before it can be used for machine learning. This data preparation usually takes most of the work, like 80-95%.

I will use an example of supervised machine learning. In supervised machine learning you pass a part of your data to a machine learning algorithm which then uses this data to build a model. With this model I can then predict to which class an unknown Iris flower belongs. This is called classification. To test how good the model is the other part of the data set, the validation set, is passed into a predictor. The predictor then uses the model to determine the class of the flower. Then the predicted and the actual class of of each flower is compared. Simply said the more matches there are, the better the model is.

The KNIME workflows created in this article is very simple and can be created in less than 5 minutes if you already know KNIME. If you have never used KNIME before, please take some time to familiarize yourself with the application. Because you are interested in machine learning I assume you are pretty good in working with computers and will quickly get to understand how KNIME works by yourself or by reading a tutorial.

Prerequisites

Please download and install KNIME.

KNIME [naim] is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. The open integration platform provides over 1000 modules (nodes), including those of the KNIME community and its extensive partner network.

With KNIME you can do data preparation and machine learning in a graphical workbench. No programming skills are required at all. In KNIME you have so called nodes. A node can either read, manipulate, visualize or write data. Nodes can be connected together to build a workflow. A KNIME workflow usually has reader node that reads in the data, then several data manipulation nodes and final a node that exports or visualizes the results.

Building the KNIME workflow

  1. Create new workflow

    create KNIME workflow

    A pop-up will appear on which you can give the workflow a name.

  2. Read the Iris data set text file

    Add Reader

    After adding the reader node you will need to configure it. Double-click on it an the configuration dialog will open. In the valid URL field enter the path to the iris data set file or browse to it. Note that the file is shipped with the KNIME installation and is in the KNIME directory in the folder IrisDataset.

    configure reader

    After the reader is configured it will turn yellow and is ready for execution. Right-click on the node and select “Execute”.

    execute reader

    If the reader node executed successfully it should turn green and you can see the data in the output port by right-clicking on the node and selecting “File Table”.

    executed reader

  3. Prepare the data

    To tell the machine learning algorithm about all discrete values in the class column, they need to be determined using the Domain Calculator node. Please add it to the workflow and then connect the reader node to it.

    connecting nodes

    The node will be auto-configured. You can directly execute it.

    domain calculator

    In the next step we will partition the data into the training set and the validation set. The training set is used to create the model and the validation set is used to test how good the model is. Add the Partitioning node to the workflow, connect it with the Domain Calculator and then configure it.

    Partitioning configuration

    Stratified Sampling will evenly distributed the different classes to the training and validation sets. The relative amount can be adjusted but this will affect the model. Also if you leave “Use random seed” unchecked, the training and validation set will differ with each run and hence also the accuracy of the model. To keep a fixed training set, please check it. After configuring the node, execute it.
    Partitioning executed

  4. Build and use the model

    We will use a decision tree for building the model. We will use the standard configuration. Add the Decision Tree Learner node to the workflow and connect the top port of the Partitioning node to it. Then execute the learner.

    executed learner

    Now add the Decision Tree Predictor node, connect the lower port of Partitioning node to it. Then connect the model output port of the learner to the predictor.

    executed predictor

  5. Validate the model

    Now we need to check how good our model can predict the correct flower class. To do so add the Scorer node and connect the predictor node to it. Then configure the Scorer node.

    configure scorer

    Execute the scorer. Then right-click on it and open the accuracy statistics. This will display information on the performance of your model.

    accuracy statistics

    The question is: What is a good model? Simply said you will want to get a high accuracy. For this model, that is enough. In other more complex scenarios you might need to especially avoid either false positives or false negatives meaning a bad accuracy with no false negatives can be better than a good accuracy but with false negatives. Context matters a lot. An example of this would be an HIV test. False positives in small numbers are not bad because you can just redo the test and almost certainly it will be negative the second time. However a false negative is unacceptable. You won’t redo the test but even worse you then will probably infect someone else.

Download KNIME workflow

After creating the workflow you can now play with the settings in the learner node or maybe change the amount in the Partitioning node and see how it affects the accuracy of the model. KNIME also has different algorithms for creating models. You could alternatively try a neural network

neural network

or a support vector machine. For this data all of the algorithms work pretty well. You can also install the Weka extension for KNIME and get access to tons of machine learning functionality from Weka.

If you have a chemistry background you might also be interested in the KNIME Labs Decision Tree Ensemble extension. The contained Tree Ensemble Learner can use fingerprints (bitvectors) for creating models. This is especially cool because KNIME is chemistry-aware. It can read sd-files and you can generated the fingerprints directly in KNIME by using for example the RDKit extentsion.

I hope this article helped you getting started with machine learning. Cheers.

Written by kienerj

May 8, 2014 at 14:44

Posted in Tools

Tagged with ,