About   AI - ML - (A)NN - DNN - DL - LLM

Core terminology
AI = Artificial Inteligence: trying to do (at least as much/well as) what humans do.
      TED talk by Ilya Sutskever , the (overly?) optimistic view
ML = Machine Learning: computational model that 'learns' from data to 'predict' outcomes (values) of an unknown function f(x).
ANN = Artificial Neural Network = NN (inspired by neural circuitry in our brain)
      consists of Input layer, Hidden layer(s), Output layer
      Each layer consists of (many) neurons, connected to all neurons in next layer.
      Strength of connection is a weight wij, to be "learned" during "training"
          Depth = # of hidden layers, Width = # of neurons in a layer
   
neuron is a simple function of input vector x: y = σ( w·x + b ) , σ() = 'activation function' (sigmoid,ReLU,...), b = bias
  So a NN calculates compositions of the activation function on linear combinations of inputs repeatedly.
  'Simple', yet profoundly capable: it is a universal approximator!!!

DNN = Deep Neural Network: more than one 'hidden' layer, may have millions/billions of parameters(weights)!
DL = Deep Learning: using DNNs
LLM = Large Language Model: text/image/video processing: predict next 'word' (token)

Important types of neural networks
  CNN = Convolutional NN, good for patern recognition, image classification
  RNN = Recurrent NN: feedback loops can act as "memory", good for sequence prediction
        * LSTM = Long Short-Term Memory NN
        * Transformers - self-attention
  GAN = Generative Adversarial Network: has a generator NN and a discriminator NN
        that compete against each other to improve performance
  GPT = Generative Pre-trained Transformer: LLM (by OpenAI)

Main types of ML
  • supervised learning: learn (parameters) from labeled data (xi,yi),
        * done by "training": find parameters(weights) to minimize a Loss Function
              e.g. minimize the Least Squares error ( + other terms )
        * done by optimization: Gradient Descent/StochasticGD / ADAM(ADAptive Moment estimation) which requires gradients
        * implemented via backpropagation (chain rule)
  • unsupervised learning: unlabeled data, discover patterns/features
  • reinforcement learning: Markov Decision process to maximize a "reward function". Transformers, self-attention

    Universal Approximation property
      * Any continuous function of n real variables with compact support can be approximated
          uniformly by a 1-hidden layer NN (Cybenko, 1989)
      * Multilayer feedforward networks are universal approximators (Hornik-Stinchcombe-White, 1989)
          can approximate any Borel measurable function from one finite dimensional space to another
    Brief history of ML, ANNs, DNNs
    Timeline of machine learning History of ANNs

    1950s: Pioneering machine learning research using simple algorithms
      1951: ANN(Minsky-Edmonds) first neural network can 'learn'
      1958: Perceptron(Rosenblatt)
    1960s: Bayesian methods for probabilistic inference in ML and
      1967: nearest neighbour algorithm for pattern recognition
      1969: ReLU activation function introduced by Fukushima
      1969: Minsky-Papert book "Perceptrons" on limitations of NNs brings the...
    1970s: 'AI winter' caused by pessimism about ML effectiveness
      1970: automatic differentiation (AD) in NNs
      1979: CNNs
    1980s: Rediscovery of backpropagation, resurgence in ML research
      1982: RNNs popularised by Hopfield
      1986: reverse mode of AD for 'learning' => backpropagation
      1989: Reinforcement Learning
    1990s: ML shifts from knowledge-driven approach to data-driven: 'learn' from big data sets
      1991: adversarial NNs
      1992: machines playing backgammon
      1992: Juergen Schmidhuber proposes 'Transformer'
      1995: Support-Vector Machines(SVNs)
      1997: LSTM (Long Short-Term Memory)
    2000s: unsupervised learning methods
      2002: Torch machine learning library
      2006: generative stochastic feedforward NN (Geoffrey Hinton)
    2010s: Deep Learning becomes feasible (on GPUs) for image and text processing
      2011: IBM's Watson beats humans in Jeopardy
      2012: recognize cats on YouTube from unlabeled images
         AlexNet algorithm for image recognition, 'AI Spring' begins
      2013: tokenization of words revolutionized text processing
      2014: ADAM (ADAptive Moment estimation) by Kingma-Ba
         GAN(Generative Adversarial Network)
         leap in face recognition (by Deepface of Facebook)
      2016: Google's AlphaGo beats humans in Go. OpenAI created by Elon Musk, Sam Altman, ...
      2017: Transformer architecture by a Google Brain team, game changer for LLMs
      2018: AlphaFold best in Protein Structure Prediction
      2019: OpenAI developed GPT-1, Google BART
    2020s: LMMs succeed !   game changer!
      2020: DALL·E (image generator) based on GPT-3 (175 billion parameters!)
      2022: Midjourney released in July, Stable Diffusion in Aug, DALL·E 2 in Sep (text to image generators)
           OpenAI releases ChatGPT in Nov (based on GPT-3.5) captivates the world
      2023: release of GPT-4, Gemini, Claude, Codex, ..., DALL·E 3, ...
      2024: multimodal models: text/code/audio/video modalities, agents, ...
          hybrid models: combining physics-based and data-based models
          Explosion in valuation of AI companies, hardware and software!

    * EXPLOSIVE growth in research/development, toward Artificial General Intelligence (AGI):
              much too fast to comprehend / control ...
    * Will have unimaginable impact in everything/everywhere/forever, good and bad (... the 'Alignment Problem' ...)
    What OpenAI really wants (very long, very revealing...)
    LLMs : Large Language Models revolution!   ChatGPT, GPT-4, BART, Claude,..., DALL-E,...
      * Pretrained on huge data sets (base model): guess next token(word,...)
      * Finetuned on labeled data --> Assistant model
      * Self-improvement via Transformers (like AlphaGo did to beat humans in Go)
    LLM Scaling Laws: performance is a smooth, well-behaved function of N=#of parameters(weights), D=amount of training data
    * A front-end 'tokenizer' divides user input('prompt') into a sequence t1, ... ,tn of tokens;
    * the LLM computes a most probable next token tn+1, then tn+2 that's most likely to follow t1, ... , tn+1, and so on,
      according to the conditional probability P(tn+1 | t1, ... ,tn), encoded by some function F(t, α) controlled by a parameter vector α.
    * Dimension of α might be 109 - 1012 !!! Size of training set of order 1015 tokens!!!
    Big problems:
      * 'hallucinations'
      * unimaginable deception could be unleashed... (deep fakes)
      * security: they suffer from attacks to overun security: Jailbreak, prompt injection, data poisoning, escalation, ...
      * huge energy consumption for training, a crucial issue... OpenAI GPT-4: $78M, Google Gemini: $191M !!!
    Good overview papers
  • Weinan E, AI for Science SIAM News, Dec 2023: exceptional, describes core ideas, impact on Science
  • Weinan E, Machine Learning and Computational Mathematics, Commun. Comput. Phys., 28, Nov 2020
  • Maja Rudolph, S Kurz, B Rakitsch Hybrid modeling design patterns, J Math in Industry 14, 2024: excellent
    and the best we can hope for towards AGI:
  • video: Demis Hassabis on TED, How AI Is Unlocking the Secrets of Nature and the Universe , May 2024