Machine Learning

What do you need to start doing machine learning

Machine learning types are divided into three categories:

  • supervised, when learning process goes through building connections between known variables and known output data. In this case output value and variable create the labelled dataset. This machine learning  type  include algorithms like decision trees, regression analysis,neural networks and others.
  • unsupervised. Here some variables and output data are unclassified. The algorithm should create labels using unsupervised algorithms. One of the unsupervised algorithms is k-means clustering.
  • reinforcement. This machine learning category that constantly upgrade it’s model by using feedback from previous iterations. Standard reinforcement learning has measurebale criteria of performance, but the output here is graded.

To start perform machine learning you will need set of input data or variables, programming language and libraries to process data,  algorithms and visualisation tools.

Important step for machine learning is data scrubbing, process of modifying and refining, deleting duplicate or incomplete data, that makes it more adjusted to work with. First step is selection of the most important features for our dataset.

Next step is row compression, when two or more rows can be reduced into one. The rows data can be text and numeric, that will make compression process more complicated. One-hot encoding is the process when text values can be converted into numerical values, that will make conversion process easier. Binning is the process of converting numeric data into True/False values.  Very often you can get a missing data in your dataset. Missing data should be approximated and there is a few techniques to do it: to use mode value, to use median value, or missing values can be removed from dataset.

Next step is to  mix the data to avoid any biases, and split it into training and test data. The test data should be around 20-30% of all data, and training data – 70-80%. After data is randomised, model should be applied to training data. Here model works with training data and  output data. As a result model will be able to analyse relationship between trained data and output data.

Further step is prediction accuracy analysis. The most frequently used method is measurement of mean absolute error. Here every prediction in the model examines with average error score.

Another important step, that can replace simple splitting data to training and testing one is cross validation, it can be exhaustive and non-exhaustive cross validation, that is maximises availability of training data.