About AI - ML - (A)NN - DNN - DL - LLM
Core terminology
AI = Artificial Inteligence:
trying to do (at least as much/well as) what humans do.
TED talk by Ilya Sutskever , the (overly?) optimistic view
ML = Machine Learning:
computational model that 'learns' from data
to 'predict' outcomes (values) of an unknown function f(x).
ANN = Artificial Neural Network = NN
(inspired by neural circuitry in our brain)
consists of Input layer, Hidden layer(s), Output layer
Each layer consists of (many) neurons,
connected to all neurons in next layer.
Strength of connection is a weight wij,
to be "learned" during "training"
Depth = # of hidden layers,
Width = # of neurons in a layer
neuron is a simple function of input vector x:
y = σ( w·x + b ) ,
σ() = 'activation function' (sigmoid,ReLU,...),
b = bias
So a NN calculates compositions of the activation function on linear combinations of inputs repeatedly.
'Simple', yet profoundly capable: it is a universal approximator!!!
DNN = Deep Neural Network: more than one 'hidden' layer,
may have millions/billions of parameters(weights)!
DL = Deep Learning: using DNNs
LLM = Large Language Model: text/image/video processing: predict next
'word' (token)
Important types of neural networks
CNN = Convolutional NN, good for patern recognition, image classification
RNN = Recurrent NN: feedback loops can act as "memory", good for sequence prediction
* LSTM = Long Short-Term Memory NN
* Transformers - self-attention
GAN = Generative Adversarial Network: has a generator NN
and a discriminator NN
that compete against each other to improve performance
GPT = Generative Pre-trained Transformer: LLM (by OpenAI)
Main types of ML
supervised learning:
learn (parameters) from labeled data (xi,yi),
* done by "training": find parameters(weights) to
minimize a Loss Function
e.g. minimize the Least Squares error ( + other terms )
* done by optimization:
Gradient Descent/StochasticGD / ADAM(ADAptive Moment estimation)
which requires gradients
* implemented via backpropagation (chain rule)
unsupervised learning:
unlabeled data, discover patterns/features
reinforcement learning: Markov Decision process to
maximize a "reward function". Transformers, self-attention
Universal Approximation property
* Any continuous function of n real variables with compact support can be approximated
uniformly by a 1-hidden layer NN
(Cybenko, 1989)
* Multilayer feedforward networks are universal approximators
(Hornik-Stinchcombe-White, 1989)
can approximate any Borel measurable function from one finite dimensional space to another
Brief history of ML, ANNs, DNNs
Timeline of machine learning
History of ANNs
1950s:
Pioneering machine learning research using simple algorithms
1951: ANN(Minsky-Edmonds) first neural network can 'learn'
1958: Perceptron(Rosenblatt)
1960s:
Bayesian methods for probabilistic inference in ML and
1967: nearest neighbour algorithm for pattern recognition
1969: ReLU activation function introduced by Fukushima
1969: Minsky-Papert book "Perceptrons" on limitations of NNs
brings the...
1970s:
'AI winter' caused by pessimism about ML effectiveness
1970: automatic differentiation (AD) in NNs
1979: CNNs
1980s:
Rediscovery of backpropagation, resurgence in ML research
1982: RNNs popularised by Hopfield
1986: reverse mode of AD for 'learning' => backpropagation
1989: Reinforcement Learning
1990s:
ML shifts from knowledge-driven approach to data-driven:
'learn' from big data sets
1991: adversarial NNs
1992: machines playing backgammon
1992: Juergen Schmidhuber proposes 'Transformer'
1995: Support-Vector Machines(SVNs)
1997: LSTM (Long Short-Term Memory)
2000s: unsupervised learning methods
2002: Torch machine learning library
2006: generative stochastic feedforward NN (Geoffrey Hinton)
2010s: Deep Learning becomes feasible (on GPUs) for image and text processing
2011: IBM's Watson beats humans in Jeopardy
2012: recognize cats on YouTube from unlabeled images
AlexNet algorithm for image recognition, 'AI Spring' begins
2013: tokenization of words revolutionized text processing
2014: ADAM (ADAptive Moment estimation) by Kingma-Ba
GAN(Generative Adversarial Network)
leap in face recognition (by Deepface of Facebook)
2016: Google's AlphaGo beats humans in Go.
OpenAI created by Elon Musk, Sam Altman, ...
2017: Transformer architecture by a Google Brain team,
game changer for LLMs
2018: AlphaFold best in Protein Structure Prediction
2019: OpenAI developed GPT-1, Google BART
2020s:
LMMs succeed ! game changer!
2020: DALL·E (image generator) based on GPT-3 (175 billion parameters!)
2022: Midjourney released in July, Stable Diffusion in Aug, DALL·E 2 in Sep (text to image generators)
OpenAI releases ChatGPT in Nov (based on GPT-3.5) captivates the world
2023: release of GPT-4, Gemini, Claude, Codex, ...,
DALL·E 3, ...
2024: multimodal models: text/code/audio/video modalities,
agents, ...
hybrid models: combining physics-based and data-based models
Explosion in valuation of AI companies, hardware and software!
* EXPLOSIVE growth in research/development,
toward Artificial General Intelligence (AGI):
much too fast to comprehend / control ...
* Will have unimaginable impact in everything/everywhere/forever, good and bad (... the 'Alignment Problem' ...)
What OpenAI really wants (very long, very revealing...)
LLMs : Large Language Models revolution!
ChatGPT, GPT-4, BART, Claude,..., DALL-E,...
* Pretrained on huge data sets (base model): guess next token(word,...)
* Finetuned on labeled data --> Assistant model
* Self-improvement via Transformers (like AlphaGo did to beat humans in Go)
LLM Scaling Laws: performance is a smooth, well-behaved function of
N=#of parameters(weights), D=amount of training data
* A front-end 'tokenizer' divides user input('prompt') into a sequence
t1, ... ,tn of tokens;
* the LLM computes a most probable next token tn+1,
then tn+2 that's most likely to follow t1, ... , tn+1, and so on,
according to the conditional probability
P(tn+1 | t1, ... ,tn),
encoded by some function F(t, α)
controlled by a parameter vector α.
* Dimension of α might be 109 - 1012 !!!
Size of training set of order 1015 tokens!!!
Big problems:
* 'hallucinations'
* unimaginable deception could be unleashed... (deep fakes)
* security: they suffer from attacks to overun security:
Jailbreak, prompt injection, data poisoning, escalation, ...
* huge energy consumption for training, a crucial issue...
OpenAI GPT-4: $78M, Google Gemini: $191M !!!
Good overview papers
Weinan E,
AI for Science SIAM News, Dec 2023:
exceptional, describes core ideas, impact on Science
Weinan E,
Machine Learning and Computational Mathematics,
Commun. Comput. Phys., 28, Nov 2020
Maja Rudolph, S Kurz, B Rakitsch
Hybrid modeling design patterns,
J Math in Industry 14, 2024: excellent
and the best we can hope for towards AGI:
video: Demis Hassabis on TED,
How AI Is Unlocking the Secrets of Nature and the Universe , May 2024