Solutions to IT problems

Solutions I found when learning new IT stuff

Archive for the ‘Tools’ Category

Machine Learning for Beginners

leave a comment »

Introduction

This article is a very brief introduction into machine learning. It does not contain any mathematics or explanations on how machine learning works. In fact I myself am a complete novice regarding the mathematical backgrounds of machine learning. I will show a basic machine learning example using KNIME and the Iris Data set. This data set contains measurements of 3 types (classes) of Iris flowers. With this data you can create a model and then determine the type of flower just by measuring it. Be aware that this data set is very clean and simple. In more real-life examples, the data would need be cleaned before it can be used for machine learning. This data preparation usually takes most of the work, like 80-95%.

I will use an example of supervised machine learning. In supervised machine learning you pass a part of your data to a machine learning algorithm which then uses this data to build a model. With this model I can then predict to which class an unknown Iris flower belongs. This is called classification. To test how good the model is the other part of the data set, the validation set, is passed into a predictor. The predictor then uses the model to determine the class of the flower. Then the predicted and the actual class of of each flower is compared. Simply said the more matches there are, the better the model is.

The KNIME workflows created in this article is very simple and can be created in less than 5 minutes if you already know KNIME. If you have never used KNIME before, please take some time to familiarize yourself with the application. Because you are interested in machine learning I assume you are pretty good in working with computers and will quickly get to understand how KNIME works by yourself or by reading a tutorial.

Prerequisites

Please download and install KNIME.

KNIME [naim] is a user-friendly graphical workbench for the entire analysis process: data access, data transformation, initial investigation, powerful predictive analytics, visualisation and reporting. The open integration platform provides over 1000 modules (nodes), including those of the KNIME community and its extensive partner network.

With KNIME you can do data preparation and machine learning in a graphical workbench. No programming skills are required at all. In KNIME you have so called nodes. A node can either read, manipulate, visualize or write data. Nodes can be connected together to build a workflow. A KNIME workflow usually has reader node that reads in the data, then several data manipulation nodes and final a node that exports or visualizes the results.

Building the KNIME workflow

  1. Create new workflow

    create KNIME workflow

    A pop-up will appear on which you can give the workflow a name.

  2. Read the Iris data set text file

    Add Reader

    After adding the reader node you will need to configure it. Double-click on it an the configuration dialog will open. In the valid URL field enter the path to the iris data set file or browse to it. Note that the file is shipped with the KNIME installation and is in the KNIME directory in the folder IrisDataset.

    configure reader

    After the reader is configured it will turn yellow and is ready for execution. Right-click on the node and select “Execute”.

    execute reader

    If the reader node executed successfully it should turn green and you can see the data in the output port by right-clicking on the node and selecting “File Table”.

    executed reader

  3. Prepare the data

    To tell the machine learning algorithm about all discrete values in the class column, they need to be determined using the Domain Calculator node. Please add it to the workflow and then connect the reader node to it.

    connecting nodes

    The node will be auto-configured. You can directly execute it.

    domain calculator

    In the next step we will partition the data into the training set and the validation set. The training set is used to create the model and the validation set is used to test how good the model is. Add the Partitioning node to the workflow, connect it with the Domain Calculator and then configure it.

    Partitioning configuration

    Stratified Sampling will evenly distributed the different classes to the training and validation sets. The relative amount can be adjusted but this will affect the model. Also if you leave “Use random seed” unchecked, the training and validation set will differ with each run and hence also the accuracy of the model. To keep a fixed training set, please check it. After configuring the node, execute it.
    Partitioning executed

  4. Build and use the model

    We will use a decision tree for building the model. We will use the standard configuration. Add the Decision Tree Learner node to the workflow and connect the top port of the Partitioning node to it. Then execute the learner.

    executed learner

    Now add the Decision Tree Predictor node, connect the lower port of Partitioning node to it. Then connect the model output port of the learner to the predictor.

    executed predictor

  5. Validate the model

    Now we need to check how good our model can predict the correct flower class. To do so add the Scorer node and connect the predictor node to it. Then configure the Scorer node.

    configure scorer

    Execute the scorer. Then right-click on it and open the accuracy statistics. This will display information on the performance of your model.

    accuracy statistics

    The question is: What is a good model? Simply said you will want to get a high accuracy. For this model, that is enough. In other more complex scenarios you might need to especially avoid either false positives or false negatives meaning a bad accuracy with no false negatives can be better than a good accuracy but with false negatives. Context matters a lot. An example of this would be an HIV test. False positives in small numbers are not bad because you can just redo the test and almost certainly it will be negative the second time. However a false negative is unacceptable. You won’t redo the test but even worse you then will probably infect someone else.

Download KNIME workflow

After creating the workflow you can now play with the settings in the learner node or maybe change the amount in the Partitioning node and see how it affects the accuracy of the model. KNIME also has different algorithms for creating models. You could alternatively try a neural network

neural network

or a support vector machine. For this data all of the algorithms work pretty well. You can also install the Weka extension for KNIME and get access to tons of machine learning functionality from Weka.

If you have a chemistry background you might also be interested in the KNIME Labs Decision Tree Ensemble extension. The contained Tree Ensemble Learner can use fingerprints (bitvectors) for creating models. This is especially cool because KNIME is chemistry-aware. It can read sd-files and you can generated the fingerprints directly in KNIME by using for example the RDKit extentsion.

I hope this article helped you getting started with machine learning. Cheers.

Advertisements

Written by kienerj

May 8, 2014 at 14:44

Posted in Tools

Tagged with ,

Using Maven and Mercurial with Netbeans IDE

leave a comment »

Introduction

For me Programming/developing is mainly a hobby. I do write the occasional script to simplify and speed up certain repetitive tasks at work or create a simple PHP web page for data entry. However I’ve never worked professionally in Programming or developed in a Team. Hence this post is intended for readers that are fairly new to developing and maintaining applications as example for someone who is starting their first open-source project or for developers who have recently switched to Java and never used these tools before or in combination. For new maven and/or mercurial users I suggest to read some tutorials about them first as I’m not going into any details or what these tools are meant for. I’m also rather a newbie in using maven and mercurial and I would appreciate comments that would indicate errors or misconceptions in this post.

Getting the tools

Netbeans can be downloaded here and maven is already bundled with it. For Mercurial on Windows I highly recommend TortoiseHg else you can download Mercurial from here. You issue mercurial commands on the command line with “hg” like hg commit. If you are wondering why “hg” I guess you slept a little too often during chemistry classes. Hg is the element symbol of mercury in the periodic table. After you finished downloading please install Netbeans and Mercurial.

Configure Netbeans

To configure maven start netbeans and go to Tools -> Options. Then click on miscellaneous and select the Maven tab. There is no need to change anything here for it to work properly however for anyone working with Windows roaming profiles I highly suggest to manually set the folder for the local repository because per default it will be created somewhere in the user profile.

To configure Mercurial switch from the Maven tab to the Versioning tab. Select Mercurial from the versioning systems list and set the user name, eg. the name used for commits, and the path to mercurial executable. Leave the other settings on default then click OK.

Setting up a new Maven Project tracked by Mercurial

In Netbeans go to File -> New Project… A new Dialog opens. Select Maven in the Categories List and then in the Project List Select the Type of the desired Project and follow the further instructions which will differ depending on your selection.

After the project is created, right click on it in and then select Versioning -> Initialize Mercurial Project. Then right-click on the project again and select Mercurial -> Properties. The default-pull and default-push properties is the central location to where you synchronize your mercurial repository. For Mercurial I can highly recommend bitbucket. It offers free hosting of private and public repositories plus you can also host git projects using the same account. In case you use bitbucket, set the default-pull and default-push properties to https://bitbucket.org/<your user name>/<project name on bitbucket>.

In the Mercurial menu you can execute common Mercurial actions like committing or reverting the project.

Configure Maven to work with Mercurial

This configuration has to be done for each project separately. It will allow you to use maven to create a new release of your project and create the according release tag in mercurial. In above created project go to Project Files and open pom.xml. Add the following snippet to the pom file and adjust it to your settings. It doesn’t matter where you add it but it must be within the “project node” of course. In case you work alone on the project I suggest to use this syntax:

<scm>
    <connection>scm:hg:file:///<full local path to project folder></connection>
    <developerConnection>scm:hg:file:///<full local path to project folder></developerConnection>
</scm>


Then right-click on the project and go to Properties -> Actions.  Click on Add Custom… and enter “Release”.  In the Execute Goals Text Box enter

buildnumber:hgchangeset release:clean release:prepare release:perform -Dgoals=install


First the buildnumber plugin will put the current Mercurial Revision number into the variable ${changeSet}. You can use this variable in other Maven plugins like the maven jar plugin and hence put the revision number in the manifest file. To do so add the following to your pom.xml:

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-jar-plugin</artifactId>
            <configuration>
                <archive>
                    <manifestEntries>
                        <Mercurial-Revision>${changeSet}</Mercurial-Revision>
                    </manifestEntries>
                </archive>
            </configuration>
        </plugin>
    </plugins>
</build>


Also follow this post for further information.Note that in the above configuration you will get an empty entry in the manifest if you run the Netbeans action build by right-clicking on the project and selecting Build. It will only be populated if you run the Release action. This is ok because you can build a project even if there are local uncommitted changes hence the build can  not have a revision number.

Then release:clean will remove any traces done by previous failed tries of release:prepare.  Then release:prepare will do exactly that, prepare the project for being release while release:perform will do the actual releasing. release:perform only works if you previously ran release:prepare. See the maven release plugin documentation for more information. Note that release:prepare will only work if you have no local changes, eg. you must commit all changes prior to releasing and there are no unknown files in the whole directory tree of the project. You either need to track a file or exclude it by adding it to the .hgignore file.

release:perform needs to have the goals parameter set to install (-Dgoals=install) because per default it expects distribution management to be configured. With it you can automatically deploy a new release to a remote maven repository. This implies that you have such a repository but I guess if your read this post you don’t have one. See the maven pom reference for information about Distribution Management. If you have a remote maven repository then configure Distribution Management and remove the goals parameter.

Considerations for Advanced Users

If there are possibly multiple users working on the same project I suggest to set the connections for the maven scm plugin to the remote repository, in this example on bitbucket:

<scm>
    <connection>scm:hg:https://bitbucket.org/<your user name>/<project name on bitbucket></connection>
    <developerConnection>https://bitbucket.org/<your user name>/<project name on bitbucket></developerConnection>
    <url>http://bitbucket.org/<your user name>/<project name on bitbucket>/src</url>
</scm>


The pom.xml should be tracked by mercurial too and if you use local file path, they won’t match for other developers. This of course could lead to tons of useless changes to the pom. However there are several downsides of using the remote repository starting with supplying login credentials. This can be automated by securely putting credential in the maven settings file.  However there are more downsides because the release:perform goal will do pushes to remote repository prior to completion when the goal can still fail and will do a checkout from remote which can be troublesome with large repositories.  Please read this blog post for more detailed information.

An other approach can be found here. Configure Maven scm plugin to checkout local and perform no push:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-release-plugin</artifactId>
    <configuration>
        <localCheckout>true</localCheckout>
        <pushChanges>false</pushChanges>
    </configuration>
</plugin>


As you can see all of these solutions lead to a certain overhead. In the latest one you will need to perform a push to remote repository manually or create a script that will do all the steps which will require the maven exec plugin.

Written by kienerj

November 3, 2011 at 15:01

Posted in Java, Programming, Tools

Tagged with , ,

Connect to Ubuntu from Windows using SSH

with one comment

Intro

If you have read my previous post  you know I had some troubles getting files from Windows to Ubuntu. This was caused by the fact that internet access was blocked by a proxy and I could not use apt-get to install openssh server. Well since this works now, let’s start.

Install OpenSSH Server

In Ubuntu run following command:
sudo apt-get install openssh-server

For further information and configuration please see this. We will continue with the default configuration.

Install WinSCP and PuTTY

  1. Download WinSCP and PuTTY.
  2. Install WinSCP
  3. Move the downloaded putty.exe to your desired folder.
  4. Copy that folder path.
  5. Run WinSCP and on the left menu click on Preferences
  6. Click on the Preferences button
  7. Go to Integration -> Applications
  8. Paste the Path to putty.exe in the according text field and click ok.

Connect to Ubuntu

In WinSCP left-side menu go to Session, select SFTP as file protocol, the Ubuntu Servers IP Address in host name, and the user name and password of a valid Ubuntu user. Then click on “Login”. You should now be connected to the Ubuntu Server and you should be able to easily transfer files between the 2 Computers.

If you want to execute terminal commands, I suggest to use PuTTY either directly by running putty.exe or in WinSCP go to Commands -> Open in PuTTY.

Increased Security

For better security you may choose to use Public key authentication. Follow this guide to set it up and then this guide to simplify future usage.

WinSCP and sudo

Sometimes it can be handy to edit config files through WinSCP and hence your desired editor. To enable file transfer and editing of files with root privileges,  do the following in Ubuntu:

  1. connect to ubuntu terminal with PuTTY
  2. cd /etc/sudoers.d
  3. create new file: sudo vi NoPWForSftp (you may choose a different file name)
  4. add yourUserName ALL=NOPASSWD: /usr/lib/openssh/sftp-server to this file
  5. save file and exit from vi (:w and then :q)
  6. change file permissions to 0440 (r–r—–):
  7. chmod o-r NoPWForSftp removes read privilege for “other” (see Ubuntu Help for further info)
  8. chmod ugo-w NoPWForSftp removes write privilege for everyone
  9. check privilege: ls -l

Now switch to Windows and in WinSCP do the following:

  1. Check “Advanced Options”
  2. If you have a stored session edit it by clicking on “Edit” at the right side of the WinSCP Window
  3. In Environment->SFTP in SFTP server text box enter sudo /usr/lib/openssh/sftp-server
  4. Click “Save” and then try to log in


You should now be able to edit files in Windows in your favorite editor. I suggest to be cautious with this and always create a copy before editing.

Written by kienerj

October 27, 2011 at 10:34

Posted in Linux, Operating Systems, Tools, Windows

Tagged with ,