What is labeled and unlabeled data

This post answers the question “What is data in machine learning?”. The term of data in ML is the core of the learning process.

Data in machine learning can come from any source, can have any size and be of any format. Big data can be also handled using different techniques that can be explained later in this module.

There is certain terminology we will use, applying to data.

Feature – is a column of data in ML, referenced by the algorithm.

Instance – row of data in the dataset.

Feature vector – list of features.

Dimension – set of attributes describing data property.

Dataset – collection of rows and instances. Datasets can be different types.

Training dataset – is the basis for the building model.

Testing dataset – the dataset that is used for testing the model/

Evaluation dataset – the dataset that is used for final verification of the model.

Coverage – measure that determine the confidence of the predicting model.

Data in ML can be two types – labeled and unlabeled. Unlabeled data is all sorts of data that comes from the source. Labeled data is the data, that has a special label assigned to it. For example, set of photos can be considered as a labeled data. Learning models can be applied to both types of data. The most precise learning models can be obtained using both labeled and unlabeled data.

Labeled data can be used by the supervised learning, unlabeled data can be used by the unsupervised learning. Semi-supervised learning can use both labeled and unlabeled data.

Educational content can also be reached via Reddit community r/ElectronicsEasy.