About AI - ML - (A)NN - DNN - DL - LLM
Core terminology
AI = Artificial Inteligence:
trying to do computationally (at least as much/well as) what humans do.
ML = Machine Learning:
computational model that 'learns' from data
to 'predict' outcomes (values) of an unknown function f(x).
ANN = Artificial Neural Network = NN
(inspired by neural circuitry in our brain)
consists of Input layer, Hidden layer(s), Output layer
Each layer consists of (many) neurons,
connected to all neurons in next layer.
Strength of connection is a weight wij,
to be "learned" during "training"
Depth = # of hidden layers,
Width = # of neurons in a layer
neuron is a simple function of input vector x:
y = σ( w·x + b ) ,
σ() = 'activation function' (sigmoid,ReLU,...),
b = bias
So a NN calculates compositions of the activation function on linear combinations of inputs, repeatedly.
'Simple', yet profoundly capable: it is a universal approximator!!!
DNN = Deep Neural Network: more than one 'hidden' layer,
may have millions/billions of parameters(weights)!
DL = Deep Learning: using DNNs
LLM = Large Language Model: text/image/video processing: predict next
'word' (token)
Important types of neural networks
CNN = Convolutional NN, good for patern recognition, image classification
RNN = Recurrent NN: feedback loops can act as "memory", good for sequence prediction
* LSTM = Long Short-Term Memory NN
* Transformers - self-attention
GAN = Generative Adversarial Network: has a generator NN
and a discriminator NN
that compete against each other to improve performance
GPT = Generative Pre-trained Transformer: LLM (by OpenAI)
Main types of ML
supervised learning:
learn (parameters) from labeled data (xi,yi),
* done by "training": find parameters(weights) to
minimize a Loss Function
e.g. minimize the Least Squares error ( + other terms )
* done by optimization:
Gradient Descent/StochasticGD / ADAM(ADAptive Moment estimation)
which require gradients
* implemented via backpropagation (chain rule)
unsupervised learning:
unlabeled data, discover patterns/features
reinforcement learning: Markov Decision process to
maximize a "reward function". Transformers, self-attention
Universal Approximation property
* Any continuous function of n real variables with compact support can be approximated
uniformly by a 1-hidden layer NN
(Cybenko, 1989)
* Multilayer feedforward networks are universal approximators
(Hornik-Stinchcombe-White, 1989)
can approximate any Borel measurable function from one finite dimensional space to another
Brief history of ML, ANNs, DNNs
Timeline of machine learning
History of ANNs
1950s:
Pioneering machine learning research using simple algorithms
1951: ANN(Minsky-Edmonds) first neural network can 'learn'
1958: Perceptron(Rosenblatt)
1960s:
Bayesian methods for probabilistic inference in ML and
1967: nearest neighbour algorithm for pattern recognition
1969: ReLU activation function introduced by Fukushima
1969: Minsky-Papert book "Perceptrons" on limitations of NNs
brings the...
1970s:
'AI winter' caused by pessimism about ML effectiveness
1970: automatic differentiation (AD) in NNs
1979: CNNs
1980s:
Rediscovery of backpropagation, resurgence in ML research
1982: RNNs popularised by Hopfield
1986: reverse mode of AD for 'learning' => backpropagation
1989: Reinforcement Learning
1990s:
ML shifts from knowledge-driven approach to data-driven:
'learn' from big data sets
1991: adversarial NNs
1992: machines playing backgammon
1992: Juergen Schmidhuber proposes 'Transformer'
1995: Support-Vector Machines(SVNs)
1997: LSTM (Long Short-Term Memory)
2000s: unsupervised learning methods
2002: Torch machine learning library
2006: generative stochastic feedforward NN (Geoffrey Hinton)
2010s: Deep Learning becomes feasible (on GPUs) for image and text processing
2011: IBM's Watson beats humans in Jeopardy
2012: recognize cats on YouTube from unlabeled images
AlexNet algorithm for image recognition, 'AI Spring' begins
2013: tokenization of words revolutionized text processing
2014: ADAM (ADAptive Moment estimation) by Kingma-Ba
GAN(Generative Adversarial Network)
leap in face recognition (by Deepface of Facebook)
2016: DeepMind's AlphaGo beats humans in Go:
altered perception of AI from tool to innovation
OpenAI created by Elon Musk, Sam Altman, ...
2017: Transformer architecture by a Google Brain team,
game changer for LLMs
2018: DeepMind's AlphaFold best in Protein Structure Prediction
2019: OpenAI developed GPT-1, Google developed BART
2020s:
LMMs succeed ! game changer!
2020: DALL·E (image generator) based on GPT-3 (175 billion parameters!)
2022: Midjourney released in July, Stable Diffusion in Aug, DALL·E 2 in Sep (text to image generators)
OpenAI releases ChatGPT in Nov (based on GPT-3.5) captivates the world
2023: release of GPT-4, Gemini, Claude, Codex, ...,
DALL·E 3, ...
2024: multimodal models: text/code/audio/video modalities,
agents, ...
hybrid models: combining physics-based and data-based models
Explosion in valuation of AI companies, hardware and software!
2025: Agentic AI:
"agents", capable of advanced reasoning and task execution,
can autonomously perceive, decide, and act
The evolution of AI , MIT Technology Review, Feb 2025
* EXPLOSIVE growth in research/development,
toward Artificial General Intelligence (AGI):
much too fast to comprehend / control ...
* Will have unimaginable impact in everything/everywhere/forever, good and bad (... the 'Alignment Problem' ...)
What OpenAI really wants (very long article, and very revealing...)
LLMs : Large Language Models revolution!
ChatGPT, GPT-4, Claude, Gemini, Llama, ...
* Pretrained on huge data sets (base model): guess next token(word,...)
* Finetuned on labeled data --> Assistant model
* Self-improvement via Transformers (like AlphaGo did to beat humans in Go)
LLM Scaling Laws: performance is a smooth, well-behaved function of
N=#of parameters(weights), D=amount of training data
* A front-end 'tokenizer' divides user input('prompt') into a sequence
t1, ... ,tn of tokens;
* the LLM computes a most probable next token tn+1,
then tn+2 that's most likely to follow t1, ... , tn+1, and so on,
according to the conditional probability
P(tn+1 | t1, ... ,tn),
encoded by some function F(t, α)
controlled by a parameter vector α.
* Dimension of α might be 109 - 1012 !!!
Size of training set of order 1015 tokens!!!
Big problems:
* 'hallucinations'
* unimaginable deception could be unleashed... (deep fakes)
* security: they suffer from attacks to overun security:
Jailbreak, prompt injection, data poisoning, escalation, ...
* huge energy consumption for training, a crucial issue...
OpenAI GPT-4: $78M, Google Gemini: $191M weights!!!
Science, Promise and Peril in the Age of AI
Quanta Mag. 2025 (excellent, articles on all aspects of AI)
Summary
of
Artificial Intelligence and Future of Work, NAS report, Apr.2025
video: in depth interview with NVIDIA CEO Jensen Huang:
Vision for the Future, Jan.2025, very informative!
Good overview papers
Weinan E,
AI for Science SIAM News, Dec 2023:
exceptional, describes core ideas, impact on Science
Fabian Merle,
ML Estimators: Implementation and Comparison in Python
Comput. Methods Appl. Math. 2025
and the best we can hope for towards AGI:
video: Demis Hassabis on TED,
How AI Is Unlocking the Secrets of Nature and the Universe , May 2024
(head of Google DeepMind, 2024 Nobel Prize recipient!)