01 - Introduction


  • Description:: ML program, examples, formulations Reference:: T. Mitchell. ML Ch1 - 2

What is ML?

  • Machine Learning is programming computers to improve a performance criteria using example data or past experience
  • ML is the task of producing knowledge from data
  • It is useful to explain something, or human expertise does not exist or in specific cases (ex. image recog)

The core ingredients are: Optimized Algorithms, Big Data and GPUs.

Learning problem

  • learning is the process of improving a knowledge on a task with experience

    • where T are tasks, P is performance measure and E is experience
    • they are fundamentals: without these elements we CANNOT talk about ML
  • the core difference between ML and doing a task, is that in ML we’re interested in improving performance over time

  • a well defined example: checkers!

    • well defined because the fundamentals are defined
      • T: Play Checkers
      • P: % of games won in world tournament
      • E: opportunity to play against self
    • the training experience could be:
      • human evaluated
      • human suggested
      • comp vs human
      • ia vs. ia
    • the key is to choose a target function, for example, given the board, return the best move
    • a target function is the function that the algorithm > is trying to learn and predict
    • an example is LMS: we can define every variable as a linear combination of weights, then compute error and updates weights
    • after training we can use the learned function that we can use to play the game and gain some insights depending on wins or losses
  • a dataset is the union of the experiences

    • the data must be homogeneous

ML Problems

Although many others exist, we are going to see these kinds of ML Problems.

  • supervised learning
    • classification
    • regression
  • unsupervised learning
  • reinforcement learning

The difference is between datasets and inputs.



  • ML

Machine Learning is the task of learning a function or approximate function, given a dataset

  • learning a function

computing an approximated function f^ that returns values as close as possible to f, especially for samples x not present in the training set D

  • Dataset

Set of samples that contains information about f

  • Function

f: X → Y: given a training set D containing information about f

We never get to f! We always aim to calculate an approximated function that is close as much as possible to f!

ML Problems

  • supervised learning
    • the dataset is a set of pairs consisting on sampling the function
    • x and y are two values that we obtain when we sample the function in that point
      • classification and regression
  • unsupervised learning
    • we only have x, not y, we don’t have any information about the output
  • reinforcement learning
    • the dataset is different → we have partial results in it!

Input X ⇒ discrete (product of finite sets) or continuous (infinite sets) Input Y ⇒ regression (product of finite sets) or classification (infinite sets) Special case: X ⇒ A1 x … Am (Ai finite) and Y ⇒ { 0, 1 } (output is just a boolean) (concept learning)


  • it is also called “Pattern Recognition”
  • examples: face recognition, handwriting recog, speech recog, medical diagnosis…


  • approximating!


  • no output available
  • clustering: grouping similar istances
  • examples: customer segmentation, image compression, learning motifs (genoma)…

Concept Learning in depth


c: target fn, X → {0,1} X: instance space x e X: one instance D = {(xi, c(xi) i=1 to n)} → dataset c(x) true if x is in D

H: hypotesis space h e H: approximation of c → one possible solution h(X): estimation of over x (predicted value)


Given a dataset, find the best approssimation h* e H of the target function c:X → {0,1}

  1. define the hypothesis space H
  2. define a performance metric to determine the best approximation
  3. define an appropriate algorithm

h is consistent with D iff h(x) = c(x) for each training example (x, c(x)) in D

Inductive learning hypothesis

“Any h that approx the target fn well over a sufficiently large set of training examples will also approx the target fn well over other unobserved examples.”

“If approx is good for a large set of examples, it will be good for unobserved ones too”

It is an assumption though!

ex. playtennis!

Version space

The version space VS h,d is a subsite of hypotheses from H consistent with all training examples in D. VS h,d ⇒ { h e H | Consistent(h, D) }

List-then-eliminate algorithm

This is a trivial bruteforce algorithm.

  1. VersionSpace → a list containing every hypotesis in H
  2. Foreach training example, remove from VersionSpace any hypothesis h for which h(x) != c(x)
  3. Output the list of hypotheses in VersionSpace


  1. VS H’D (H’ can represent all the subsets of X)
    • cannot predict istances not in D
  2. VS HD (H can not represent all the subsets of X)
    • limited prediction instances not in D
  3. h* e VS HD (h* is one hypothesis chosen within the VS)
    • max prediction power on valeus not in D

If hypotesis is too powerful (any subset can be represented in it) and the search is complete (all the solutions are computed) then the system is not able to classify new istances (no generalization power).

Noisy data

Noisy data refers to data that contains errors or inconsistencies. These errors can come from various sources such as sensor inaccuracies, human mistakes during data entry, or other imperfections in data collection processes. They can significantly impact the performance and accuracy of a model.

Challenges of Noisy Data:

  • Misleading Patterns: Noisy data can introduce patterns that do not actually exist in the underlying data distribution. Machine learning models might learn these patterns, leading to incorrect predictions on new, unseen data.

  • Reduced Accuracy: Noise can obscure the true relationships between variables, making it harder for machine learning algorithms to accurately capture these relationships.

Handling Noisy Data:

  • Data Cleaning: Identifying and correcting errors in the data is crucial. Techniques such as outlier removal, imputation, and filtering can be applied to clean the data.

  • Robust Algorithms: Some machine learning algorithms are more robust to noisy data. For instance, decision trees and random forests are less affected by noise compared to linear models.

Representation vs. Generalization Power


Representation is to the capacity of a model to learn and encapsulate complex patterns from the input data. Deep learning models, such as neural networks, are capable of learning intricate representations from raw data. A rich representation allows the model to understand different features and patterns within the data.

Generalization Power

Generalization is the ability of a machine learning model to perform well on unseen, new data. A model with good generalization can make accurate predictions on data it has never seen before. Achieving good generalization is the primary goal in machine learning, as the ultimate purpose is to build models that perform well in real-world scenarios.


There exists a fundamental trade-off between the model’s representation and its generalization power. A model with very high representational capacity (like a deep neural network with many layers and parameters) can memorize the training data, including its noise and outliers. This may lead to poor generalization because the model has become too specific to the training data and fails to generalize to new, unseen data.

On the other hand, simpler models with lower representational capacity (like linear models) might not capture all the complexities of the data, leading to underfitting. Underfit models lack the capacity to learn important patterns in the data, resulting in poor performance on both the training and unseen data.