AncientML is a series of paper reading notes. The purpose is to review outstanding contributions to machine learning that are valuable to the formation as an academic field.

Some rules about the papers:

- have at least 500 citations
- be sufficiently old so that interest in them cannot be considered a conflict for industry ML researchers and engineers
- have had impact on academia so that they would be considered valuable to teach

It’s not supposed to be a summary but rather inspire reading of the papers itself and discussions in person.

## A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955 (McCarthy et al., 2006), PDF

- The paper/event that gets credited with the foundation of the field of Artificial Intelligence research.
- The paper is three pages long and the authors include Claude Shannon.
- scale of the proposed project: 2 months, 10 men
- focused on language, abstraction and concepts
- identifies seven areas to improve: Automatic Computers, How Can a Computer be Programmed to Use a Language, Neuron Nets, Theory of the Size of a Calculation, Self-Improvement, Abstractions, Randomness and Creativity
- “the major obstacle is not lack of machine capacity, but our inability to write programs”
- There is Wikipedia article on the Dartmouth workshop.
- 102 pages of Ray Solomonoff’s hand written notes including some doodles on page 3.

## The Mathematical Theory of Communication (Shannon et al., 1951), PDF

- Central paper for many fields. 90 pages (skip the part by Weaver).
*The Idea Factory*(Gertner, 2012) is a book about Bell Labs around that time.- Khinchin (1957) is a book that discusses this paper.
- p.49:
*information*is not attached to a particular message but to the amount of freedom of choice - p.49: “decomposition of choice” is a beautiful requirement for \(H\), and leads with the other two requirements to a unique form for \(H\)
- p.50: simple example to visualize the connection between probability of a message and information is shown in the figure below
- p.53: origin for terms of the form \(p_i\log{}p_i\)
- p.56: relative entropy, maximum possible compression, redundancy
- p.70: capacity of a noisy channel; includes a
`max()`

over all possible information sources - …

## Backlog

- Multidimensional binary search trees used for associative searching, (Bentley, 1975), PDF
- RBM predecessor Harmonium: Information processing in dynamical systems: Foundations of harmony theory, (Smolensky, 1986), PDF
- Reducing the Dimensionality of Data, (Hinton and Salakhutdinov, 2006), PDF
- Online Convex Programming and Generalized Infinitesimal Gradient Ascent (Zinkevich, 2003), PDF
- Supervised Sequence Labelling with Recurrent Neural Networks (Graves, 2012), PDF
- High-speed tracking with kernelized correlation filters, (Henriques et al., 2015), PDF

Similar resources: @shakir_za tweets a series called “Sunday Classic Paper”.

## Bibliography

Jon Louis Bentley.
Multidimensional binary search trees used for associative searching.
*Communications of the ACM*, 18(9):509–517, 1975. ↩

Jon Gertner.
*The Idea Factory: Bell Labs and the great age of American innovation*.
Penguin Press, New York, 2012.
ISBN 978-0143122791. ↩

Alex Graves.
Supervised sequence labelling.
In *Supervised sequence labelling with recurrent neural networks*, pages 5–13.
Springer, 2012. ↩

João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista.
High-speed tracking with kernelized correlation filters.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 37(3):583–596, 2015. ↩

Geoffrey E Hinton and Ruslan R Salakhutdinov.
Reducing the dimensionality of data with neural networks.
*Science*, 313(5786):504–507, 2006. ↩

A Khinchin.
*Mathematical foundations of information theory*.
Dover Publications, New York, 1957.
ISBN 978-0486604343. ↩

John McCarthy, Marvin L Minsky, Nathaniel Rochester, and Claude E Shannon.
A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955.
*AI magazine*, 27(4):12, 2006. ↩

Claude E Shannon, Warren Weaver, and Arthur W Burks.
The Mathematical Theory of Communication.
*The University of Illinois Press*, 1951. ↩

Paul Smolensky. Information processing in dynamical systems: foundations of harmony theory. Technical Report, Colorado University at Boulder Dept of Computer Science, 1986. ↩

Martin Zinkevich.
Online convex programming and generalized infinitesimal gradient ascent.
In *Proceedings of the 20th International Conference on Machine Learning (ICML-03)*, 928–936. 2003. ↩