Sven Kreiss

Achievements in Two Years at EPFL

2020-04-03T00:00:00+02:00

Two years ago, I returned from New York City to Geneva. I had left my industry job and returned to academia as a postdoc in Computer Vision at EPFL. It has been an exciting two years for me in Alexandre Alahi's lab for Visual Intelligence for Transportation. When entering academia from industry, I had no research or publication momentum. As a starting point, I tried to make connections to my previous work in physics, particularly in statistical and physical modeling. This led me to focus on Composite Fields for human pose estimation.

Past

In an extremely short time, we have created three exciting projects with interesting scientific contributions and state-of-the-art results on challenging tasks like Human Pose Estimation. These papers were presented at ICRA, CVPR and ICCV which are the top-tier conferences in robotics and computer vision.

Crowd-Robot Interaction: Crowd-aware Robot Navigation with Attention-based Deep Reinforcement Learning.
Presented at ICRA2019 in Montreal. Lead by Changan Chan and Yuejiang Liu.
paper code
PifPaf: Composite Fields for Human Pose Estimation.
Presented at CVPR2019 in Los Angeles.
paper code
MonoLoco: Monocular 3D Pedestrian Localization and Uncertainty Estimation.
Presented at ICCV2019 in Seoul. Lead by Lorenzo Bertoni.
paper code

I also presented an overview of our work at the Autonomous Driving workshop at ICML2019 in Los Angeles, at KAIST in South Korea as well as at the European and Swiss transportation conferences hEART (Budapest) and STRC (Ascona).

PifPaf has also created interest in industry. Setting up a commercialization pipeline with EPFL's Technology Transfer Office was another exciting first.

At EPFL, it is also possible to gain professor-style teaching experience as a postdoc through co-teaching. Alex and I are currently co-teaching for the second time Deep Learning for Autonomous Vehicles to master students. The course project ends with a race of human-robot pairs. The student is in front and the robot has to track and follow as fast as possible. At was quite an event last year (video). Last semester, we gave an introductory course in machine learning tailored for engineering students in the bachelor program.

Current

I have three papers under review that extend my previous work in human pose estimation and joint work on pedestrian localization and metric learning. All of these are very exciting and I am looking forward to share more soon.

Another new process for me was applying for grants. I successfully obtained a Spark Grant from the Swiss National Science Foundation for the development of Deep Social Force that pays for the project for one year -- my first time being a PI on a grant. I also drafted our lab's contribution to the EPFL Interdisciplinary Seed Fund project to use Artificial Intelligence for understanding and overcoming motor deficits in collaboration with two labs at the hospital in Lausanne that work on Parkinson and Spinal Cord Injury.

I am also looking forward to ICML this year. It is the first time I am the co-organizer of a workshop, the workshop on AI for Autonomous Driving, which had 400 attendees last year. Luckily, I am surrounded by an experienced team and I am looking forward to learn. Holding this workshop entirely online will be a first for everyone though.

After two years back in academia, I am still not bored. Not at all. Thanks VITA team! Onwards.

Artisanal S2 Cells

2018-11-28T00:00:00+01:00

S2 cells are a great representation for locations on earth and are the storage format for many popular web services, including Google Maps, Foursquare and Pokemon Go. I worked on the s2sphere Python implementation when I was at Sidewalk Labs. I also wrote some background notes on the Sidewalk blog. Recently, I was asked to explain how to verify manually that an S2 cell id is correct. By hand. From scratch. Here is my answer for the simplest ids.

Get a feel for the cells and faces of the cube by using the web tools at s2.sidewalklabs.com:

The above images show the location of face 0 on earth. Below is the unfolded curve how it spans and connects to the other faces of the cube:

Tokens are hex format with right zeros stripped. To recover an integer, use from_token() and print as binary. The example tokens converted to binary cell ids are:

token "04": 0000010000000000000000000000000000000000000000000000000000000000
token "0c": 0000110000000000000000000000000000000000000000000000000000000000
token "14": 0001010000000000000000000000000000000000000000000000000000000000
token "1c": 0001110000000000000000000000000000000000000000000000000000000000

The binary format from left to right: three bits for face (here face 0), two bits to encode the cell on that face, 1 bit terminating. In agreement with what is in the docstring for the CellId class :)

Just to compare, face 1, level 1 cell ids are:

0010010000000000000000000000000000000000000000000000000000000000
0010110000000000000000000000000000000000000000000000000000000000
0011010000000000000000000000000000000000000000000000000000000000
0011110000000000000000000000000000000000000000000000000000000000

Et voilà. Those are eight hand-crafted S2 cell ids.

Python at a University Lab

2018-11-23T00:00:00+01:00

I am a post-doc at VITA lab at EPFL university.

Good software engineering practices are important to me. They are foundational for open and reproducible science. You -- a current or future member of a Machine Learning lab -- are not necessarily expected to be a computer scientist or software engineer, but the tools must be familiar. Expert-level familiar. You will use these tools most of the day. You should not be clumsy with or afraid of these tools. Below is an ambitious list, so don't worry if you are not familiar with most of it yet.

Python Core: The best way to get started and a great resource for advanced users is the tutorial on the official Python webpage. Here are my notes more specific to ML.

always use Python3 now, not Python2
read Python code of great projects outside of ML (almost no ML projects are written well, see closing remarks): start with requests
Python core and standard libraries (always available, so use them):
- list comprehensions
- classes: when your function becomes too long it might want to be a class
- sets: avoid testing whether an element is in a list, test whether it is in a set
- argparse, use a standard library to parse command line arguments
- logging
- defaultdict for sparse accumulators
- generators and yield, see functional programming howto
- all itertools (e.g. itertools.chain.from_iterable)
- functools.lru_cache is a powerful tool
- property decorator is great for adding caches
- more advanced but useful: context managers, decorators
understand stack traces (the basics: the line above is where the line below originated), Wikipedia
scientific foundational packages: numpy, scipy, scikit-learn and depending on your focus cython

Packaging: As soon as you need to re-use code in a second file, you will want to import one of your own files. You should use relative imports and to use those properly you need to package your code. Packaging in Python used to be a mess, but much progress has been made over recent years. When you search for help related to packaging, filter for results within the last year.

pip, I have never used conda
relative imports: see tutorial section on Modules
setup.py: see setuptools basic use
venv: I never do anything outside a virtual environment
Jupyter notebooks: Great for demos and log books. Never to code.
mybinder.org to make your Jupyter demos interactive

Style / Unit Tests / Continuous Testing: Standard practices for software engineers are useful for ML projects, too. They help. Beyond testing code, unit tests in combination with continuous integration give a robust and reproducible starting point for anyone picking up your project (including yourself in one year).

PEP8 is the Style Guide for Python and contains explanations. Don't skip A Foolish Consistency is the Hobgoblin of Little Minds. Still, I follow almost all of PEP8 to the letter. Additional:
- no abbreviations for variables, classes, functions, etc.
- do not iterate over indices in Python: don't do for i in range(len(mylist))
pylint generally provides good advice beyond checking for PEP8
pytest to run all your tests
CircleCI and TravisCI can automatically run your tests on every commit to git

Red Flags: When browsing open source code, this is when I get worried.

no eval metrics/scripts for that implementation
have to change the PATH or PYTHONPATH variables
have to copy or symlink folders with code

Closing: Infrastructures at large companies are different from a university lab and your laptop: containers, distributed storage, build systems, custom DNS, mono repositories. A small piece in the middle of the stack is open source and has been duct-taped to make it work without the rest of the infrastructure. Don't follow blindly what seems like a standard practice of the best software engineers in the world. It might be an artifact of porting it to the open source world.

Feedback is welcome on Twitter @svenkreiss.

AncientML 1

2018-03-05T00:00:00+01:00

AncientML is a series of paper reading notes. The purpose is to review outstanding contributions to machine learning that are valuable to the formation as an academic field.

A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955 (McCarthy et al., 2006), PDF

The paper/event that gets credited with the foundation of the field of Artificial Intelligence research.
The paper is three pages long and the authors include Claude Shannon.
scale of the proposed project: 2 months, 10 men
focused on language, abstraction and concepts
identifies seven areas to improve: Automatic Computers, How Can a Computer be Programmed to Use a Language, Neuron Nets, Theory of the Size of a Calculation, Self-Improvement, Abstractions, Randomness and Creativity
"the major obstacle is not lack of machine capacity, but our inability to write programs"
There is Wikipedia article on the Dartmouth workshop.
102 pages of Ray Solomonoff's hand written notes including some doodles on page 3.

The Mathematical Theory of Communication (Shannon et al., 1951), PDF

Central paper for many fields. 90 pages (skip the part by Weaver).
The Idea Factory (Gertner, 2012) is a book about Bell Labs around that time.
Khinchin (1957) is a book that discusses this paper.
p.49: information is not attached to a particular message but to the amount of freedom of choice
p.49: "decomposition of choice" is a beautiful requirement for $H$, and leads with the other two requirements to a unique form for $H$
p.50: simple example to visualize the connection between probability of a message and information is shown in the figure below
p.53: origin for terms of the form $p_i\log{}p_i$
p.56: relative entropy, maximum possible compression, redundancy
p.70: capacity of a noisy channel; includes a max() over all possible information sources

Bibliography

Jon Gertner. The Idea Factory: Bell Labs and the great age of American innovation. Penguin Press, New York, 2012. ISBN 978-0143122791. ↩

A Khinchin. Mathematical foundations of information theory. Dover Publications, New York, 1957. ISBN 978-0486604343. ↩

John McCarthy, Marvin L Minsky, Nathaniel Rochester, and Claude E Shannon. A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955. AI magazine, 27(4):12, 2006. ↩

Claude E Shannon, Warren Weaver, and Arthur W Burks. The Mathematical Theory of Communication. The University of Illinois Press, 1951. ↩

pelican-jsmath Plugin

2018-02-18T00:00:00+01:00

The new plugin is $\alpha\omega\epsilon s \sigma m \epsilon$, particularly in combination with KaTeX which is used here. It has good support for big equations:

$$E=mc^2$$

A vector $\vec{a}$ looks beautiful. Writing order of magnitudes with $\mathcal{O}(n)$ is pretty. There was a related Pelican issue for support for KaTeX.

The plugin is packaged and can be installed with pip install pelican-jsmath (with a dash) and then added to Pelican in pelicanconf.py by adding 'pelican_jsmath' (with an underscore) to your PLUGINS list. See the Readme for more details.

Packaging Pelican Plugins

It is great that Pelican supports plugins installed via pip and outside the plugins directory. It gives the plugin author and user more control over the plugin version. This is why I wanted to document the steps I took to make pelican-jsmath a Python package.

Simplest setup.py file:

from setuptools import setup


setup(
    name='pelican-jsmath',
    version='0.1.0',
    description='Pelican Plugin that passes math to JavaScript.',
    url='http://github.com/svenkreiss/pelican-jsmath',
    author='Sven Kreiss',
    author_email='me@svenkreiss.com',
    license='AGPL-3.0',
    packages=['pelican_jsmath'],
)

If you are converting a plugin from the pelican-plugins repository, move your files into a folder, here pelican_jsmath, and add a setup.py file. That's it. You can submit it to pypi if you want, but you can also tell people to install directly from Github using pip install https://github.com/svenkreiss/pelican-jsmath/zipball/master.

With packaged plugins, you can manage your dependencies in your requirements.txt as usual.

Testing Packaged Plugins

This Pelican plugin includes a plugin for the Python Markdown parser that modifies the HTML output. It is good to test that this plugin produces valid HTML. The repository includes an example Pelican site which is regenerated on every commit and validated with html5validator.

Sample

Pelican 2018

2018-02-10T00:00:00+01:00

Inspired by Fred Wilson's post about Owning Yourself, I revived my Pelican blog. All my website's configuration and source files are public as well as my modifications to the pure theme. Reasons for Pelican compared to hosted solutions:

content under version control in git
my usual text editor for content creation
fully owning content and its exact presentation
complete customizability, therefore I want Python, therefore Pelican

A cost is that I have to contribute some changes myself:

customized Pure theme: print mode, less author mentions, responsive resizing for mobile, using the pygments friendly style for code highlighting
pelican_jsmath: using KaTeX: nothing existed to combine it with Pelican and so I created pelican-jsmath. The new plugin is $\alpha\omega\epsilon s \sigma m \epsilon$ and described in a separate blog post.
pelican-cite: create a nicely formatted Bibliography from a bibtex file. I created PR#5 so that it also works on draft pages.
image_process: Responsive images (smaller images on smaller devices) are especially important for the projects page and on article index page. The plugin works on all generated files just before they are written, so you can use it everywhere in your theme as well.
pelican-advance-embed-tweet: submitted PR#2 to remove align attribute from <bockquote> which is not HTML5, and you can instead set TWITTER_ALIGN = 'center' in your pelicanconf to center the embedded tweet.
gravatar: request higher resolution Gravatar images by adding ?s=140 to the image url in the theme
related_posts: newly added to this blog
representative_image: automatically extract an image from an article and use it in article list with image_process to thumbnail
pelican_dynamic: adds options to add per article custom css and js

Open and in-progress issues:

hack of the day: make your default status draft (generally a good idea), but then set the default status in publishconf.py to hidden. When trying to publish, hidden will create an error and the file will be skipped. So the article wont exist on the web at all until its status is set to draft. Problem: have to split OUTPUTDIR in the Makefile so that two different directories are used for make devserver and make publish (filed issue#2284 against Pelican; see Makefile).
Have to wrap Tweet embeds in <div> to avoid Markdown's <p> tag because the embedded tweet includes a <blockquote> which cannot appear inside a <p> tag.

Always validate html with html5validator and check links with the W3C Link Checker.

Pelican Plugins

Most Pelican Plugins are distributed through the central pelican-plugins repository. The contributed plugins have fairly little recognition: render_math has 65 stars and image_process has 10 stars. Issues against individual plugins are filed in the central repository and not against the individual plugins. Updates to plugins have to wait for inclusion in the central repository before they become used by others.

Python provides its standard mechanism with pip, setuptools, requirements.txt, etc. to manage dependencies. Plugins can be written to support pip and Pelican does support importable plugins. This also allows unit tests and continuous integration to ensure the quality of the plugin. This is the method I chose for pelican-jsmath.

Stream Processing in pysparkling

2017-03-11T00:00:00+01:00

pysparkling is a native Python implementation of PySpark. Stream processing is considered to be one of the most important features of Spark. PySpark provides a Python interface to Spark’s StreamingContext and supports consuming from updating HDFS folders and TCP sockets and provides interfaces to Kafka, Kinesis, Flume and MQTT. Initial support for stream processing from folders and TCP sockets is in pysparkling 0.4.0 which you can now install with:

pip install --upgrade pysparkling

Counting Example

In the normal batch processing way, you can count elements with:

>>> from pysparkling import Context
>>> Context().parallelize([1, 2, 3]).count()
3

This is similar for stream processing. Incoming data is batched every 0.1 seconds (the batch interval — the second parameter to StreamContext) and elements are counted in 0.2 second windows, i.e. two batch intervals, which returns the count of the first batch, the count of the first and second batch and the count of the second and third batch:

>>> import pysparkling
>>> sc = pysparkling.Context()
>>> ssc = pysparkling.streaming.StreamingContext(sc, 0.1)
>>> (
...     ssc
...     .queueStream([[1, 1, 5], [5, 5, 2, 4], [1, 2]])
...     .countByWindow(0.2)
...     .foreachRDD(lambda rdd: print(rdd.collect()))
... )
>>> ssc.start()
>>> ssc.awaitTermination(0.35)
[3]
[7]
[6]

Other new features apart from the pysparkling.streaming module are an improved pysparkling.fileio module, methods for reading binary files (binaryFiles() and binaryRecords()) and more inline examples in the documentation.

Head over to the RDD (batch datasets) and DStream (discrete stream) documentations to learn more!

Decouple

2016-12-31T00:00:00+01:00

The paper Decoupling theoretical uncertainties from measurements of the Higgs boson (Cranmer et al., 2015) by Kyle Cranmer, David Lopez-Val, Tilman Plehn and me is now published in Phys Rev D91 and also available at arXiv:1401.0080.

We develop a technique to present Higgs coupling measurements, which decouple the poorly defined theoretical uncertainties associated to inclusive and exclusive cross section predictions. The technique simplifies the combination of multiple measurements and can be used in a more general setting. We illustrate the approach with toy LHC Higgs coupling measurements and a collection of new physics models.

We share all the software that was involved. That includes that we published the model files and a Mathematica notebook on figshare. We put all the code on github which can be used for new models or to reproduce all plots in the paper simply by typing make. We even have a little demo project that pulls three pre-made decoupled models from a webpage and recouples them and produces two plots of combined benchmark coupling models.

Bibliography

Kyle Cranmer, Sven Kreiss, David Lopez-Val, and Tilman Plehn. Decoupling Theoretical Uncertainties from Measurements of the Higgs Boson. Phys. Rev., D91(5):054032, 2015. arXiv:1401.0080, doi:10.1103/PhysRevD.91.054032. ↩

Databench v0.4

2016-09-22T00:00:00+02:00

Databench v0.4 is released. It is a major change from the v0.3 branch. All documentation, examples and demos are updated. Install the new version with

pip install --upgrade databench

Here are the highlights:

Migrated from Flask to Tornado and with that a switch from Jinja2 templates to Tornado templates.
With that new backend, Python 2.7, 3.4 and 3.5 are supported.
The previous version had many dependencies and a major goal of the refactor was to reduce the number of dependencies. This version only depends on tornado, pyyaml and pyzmq. Markdown and docutils to support md and rst readme files are optional.
A Datastore was added. This concept encourages a consistent pattern for state that works with multiple threads and languages (see the new part of the documentation on data flow).
Front-end code in ES6 that is transpiled to legacy JavaScript. Also for analysis code, there is built in support for node_modules.
Unit tests run automatically on every commit. Also the documentation is built and updated on every commit. The test coverage of the code is also updated continuously and is currently at 95%. Unit tests always run for Python 2.7, 3.4 and 3.5.

If you want to jump right in, start with the documentation and have a look at some examples.

word2vec on Databricks

2015-12-22T00:00:00+01:00

Word2vec is an interesting approach to convert a word into a feature vector (original C code by Mikolov et al). One of the observations in the original paper was that words with similar meaning have a smaller cosine distance than dissimilar words. Here is a histogram of the pairwise cosine distances of about 500 media topics (derived from IPTC news codes):

Cosine distance is defined as 1 - cos(vector1, vector2). Most of the vector pairs only have slightly smaller angles than 90° which makes sense as more topics are unrelated to each other than related. The closest 5% of vector pairs are still separated by angles up to 73°. The smallest angular separation is 18° between breaststroke and backstroke and the second smallest is 27° between triple_jump and pole_vault.

To visualize these topics below, the 300-dimensional word vectors are embedded in two dimensions using t-SNE. Edges between the topics with the smallest 5% in cosine distance in the original space are drawn in orange.

Similar topics are indeed close together. However, one could argue that imports is the opposite of exports and therefore should not be close together; but they are (at the bottom). Similarly, employment is close to unemployment. This is not how a person would think about “similarity” in this context, but it makes sense given the skip-gram training of the word vectors: a neural network tries to predict a word (here a topic) given a window of surrounding words. These topics would indeed appear in news articles with similar words surrounding them. It is important to keep this subtlety in mind when building tools on top of word2vec.

Using word2vec on Databricks

Spark and MLlib come with a built-in implementation of word2vec. However, we also want to apply word2vec in stand-alone Python and therefore chose the gensim implementation.

We use Databricks to process a large number of documents (not for training word2vec, but to apply word2vec). We create a “Mapper Tool” that can convert text to word vectors that is distributed in a Python egg. This tool reads in previously created word vectors from a compressed binary file that is larger than 1GB which takes about a minute.

There are two ingredients that we need: a large binary input file available at all worker nodes and a way to cache the word vectors in memory across map operations.

dbutils.fs.mount('s3n://your_bucket/some_folder', '/mnt/some_folder')

The default scheme is dbfs:/, not file:/, which means that this S3 folder is now available in dbfs. dbutils can copy data from dbfs to the local file system, but only on the driver. On worker instances, dbutils is not available. However, dbfs is mounted using FUSE at file:/dbfs (Databricks Forum post) and we can use the local file path /dbfs/mnt/some_folder/word2vec_file.bin.gz on the driver and the workers.

Mapper Tool

The tool is a wrapper around the word2vec implementation in the Python package gensim, gensim.models.Word2Vec. We want an in-memory cache that is persistent across map operations. Python class variables are not serialized when serializing an instance of a class and therefore we can use it as a process-wide cache.

from gensim.models.word2vec import Word2Vec

class Tool(object):
    cache = {}
    def __init__(self, filename):
        self.fn = filename

    @property
    def word2vec(self):
        if self.fn not in Tool.cache:
            Tool.cache[self.fn] = \
                Word2Vec.load_word2vec_format(self.fn, binary=True)
        return Tool.cache[self.fn]

    def map(self, word):
        if word not in self.word2vec:
            return None
        return self.word2vec[word]

You can use this tool on the driver or in a map function that gets shipped to the workers. The call to load_word2vec_format() is expensive, but in this design only executed once in each process.

Example application:

filename = '/dbfs/mnt/some_folder/word2vec_file.bin.gz'
sentence = ['Some', 'sentence', 'as', 'a', 'test']
sc.parallelize(sentence, 2).map(Tool(filename).map).collect()

There are of course other ways to accomplish this, but I wanted to share the method that works well for us.

Summary

This post gave an introduction to word2vec and showed how to distribute a large input file to worker nodes on Databricks. It also showed how to create a Mapper Tool that can cache input data across map jobs in memory.

During this work, I submitted two pull requests to gensim #545 and #555 which are merged into the master branch. With the next release, load_word2vec_format() will be faster.

Parallel Processing with pysparkling

2015-12-04T00:00:00+01:00

pysparkling is a pure Python implementation of Apache Spark's RDD interface. It means you can do pip install pysparkling and start running Spark code in Python. Its main use is in low latency applications where Spark operations are applied to small datasets. However, pysparkling also supports the parallelization of map operations through multiprocessing, ipcluster and futures.concurrent. This feature is still in development, but this post explores what is already possible. Bottlenecks that were found while writing this post are now included in version 0.3.10.

Benchmark

I wanted to measure a CPU-bound benchmark to see the overhead of object serialization in comparison to actual computations. The benchmark function is a Monte Carlo simulation to calculate the number $\pi$. It generates two uniformly distributed random numbers x and y each between 0 and 1 and checks whether $x^2 + y^2 < 1$. The fraction of tries that satisfy this condition approximates $\pi$/4.

To understand the process better, I instrumented the job execution with timers. The cumulative time spent in parts like data deserialization on the worker nodes and a few more are aggregated in the Context._stats variable.

A few problems became apparent:

The function that is applied in a map operation is the same for all partitions of the dataset. In the previous implementation, this function was serialized separately for every chunk of the data.
Through a nested dependency, all partitions of the data were sent to all the workers. Now only the partition that a given worker processes is sent to it.
Another slowdown was that core pysparkling functions were not pickle-able. That is not a problem for cloudpickle, but serializing and deserializing non-pickle-able functions takes longer. The map() and wholeTextFiles() methods have pickle-able helpers now.

Results

The test was run on a 4-core Intel i5 processor and this is the result:

Achieving a 3x improvement with four cores is good in real world benchmarks. The new Context._stats variable gives more insight into where time is actually spent. The numbers below are normalized with respect to the time spent in the execution of the map function. The results for this CPU bound benchmark with four processes are:

map exec: 100.0%
driver deserialize data: 0.0%
map cache init: 0.2%
map deserialize data: 0.0%
map deserialize function: 2.1%

Most of the time is spent in the actual map where it should be. The time it takes to deserialize the map function is 2.1% of the time it takes to execute it. The benchmark itself is run as a unit test in tests/test_multiprocessing.py and the plots can be recreated with ipython tests/multiprocessing_performance_plot.py.

The test was also run on a 4-core Intel i7 processor with Hyperthreading. You can see that the performance is slightly better than with the i5, but that the doubled threads do not double the performance.

As a first pass at multiprocessing with pysparkling, this is a good result. Please check out the project on Github, install it with pip install pysparkling and send feedback.

pysparkling Talks

2015-08-16T00:00:00+02:00

PyGotham (25min): slides, video
Hack-and-Tell (5min): slides

Links

Documentation: pysparkling.trivial.io
Github: svenkreiss/pysparkling

pysparkling

2015-05-29T00:00:00+02:00

pysparkling is a native Python implementation of the interface provided by Spark’s RDDs. In Spark, RDDs are Resilient Distributed Datasets. An RDD instance provides convenient access to the partitions of data that are distributed on a cluster. New RDDs are created by applying transformations like map() and reduce() to existing RDDs.

pysparkling provides the same functionality, but without the dependency on the Java Virtual Machine, Spark and Hadoop. The original motivation came from implementing a processing pipeline that is common for machine learning: process a large number of documents in parallel for training a classification algorithm (using Apache Spark) and using that trained classification algorithm in an API endpoint where it is applied to a single document at a time. That single document has to be preprocessed however with the same transformations that were also applied during training. This is the task for pysparkling.

Removing the dependency on the JVM, Spark and Hadoop comes at a cost:

Hadoop file io is gone, but its core functionality is reimplemented in pysparkling.fileio. This by itself is very handy as you can read the contents of files on s3://, http:// and file:// and optionally with gzip and bz2 compression just by specifying a file name. The name can include multiple comma separated files and the wildcards ? and *.
Managed resource allocation on clusters is gone (no YARN). Parallel execution with multiprocessing is supported though.

It also comes with some advanced features:

Parallelization with any object that has a map() method. That includes multiprocessing.Pool and concurrent.futures.ProcessPoolExecutor.
It does provide lazy and distributed execution. For example, when creating an RDD from 50,000 text files with myrdd = Context().textFile('s3://mybucket/alldata/*.gz') and only accessing one record with myrdd.takeSample(1), pysparkling will only download a single file from S3 and not all 50,000.

Quickstart

Install pysparkling with pip install pysparkling. As a first example, count the number of occurrences of every word in README.rst:

from pysparkling import Context
counts = (
    Context().textFile('README.rst')
    .flatMap(lambda line: line.split(' '))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)
print(counts.collect())

More examples including how to explore the Common Crawl dataset and the dataset of the Human Biome Project are in this IPython Notebook.

Wildcardians on Twitter

2015-04-27T00:00:00+02:00

Last Wednesday was Hack Day at Wildcard. This is a small extension of what we had done there:

This graph was built from one API call to Twitter per person at Wildcard. So only 23 API calls. People at Wildcard are represented by blue dots and their Twitter handle next to it. The size is related to the number of followers. Orange dots are tweets. Black dots are other Twitter handles that were mentioned.

The real visualization is interactive. You can hover over every tweet and read its content and hover over every mentioned twitter handle to reveal it. That way, I discovered a few interesting Twitter accounts that are mentioned in the tweets.

The backend is a Python script. The front-end is a Databench analysis with a d3.js visualization.

Collaborative Statistical Modeling

2015-04-11T00:00:00+02:00

This poster was created by Kyle Cranmer and me. It is about the tools we built and were part of the discovery of the Higgs boson. It’s from a while ago, but it needs more exposure as the work behind it is applicable outside of physics, but largely unknown.

PDF of the poster.

The Higgs group of the ATLAS experiment (one of the two large experiments at CERN) has a few hundred members working in seven subgroups. The final statistical test to claim the discovery is done with a combined statistical model with input models from all subgroups, and in addition models from detector performance groups and theoretical models from outside the Higgs group. It is based on statistical methods and technical innovation that deserves more attention. Outside of particle physics, this is a topic that is gaining interest, but people are unaware of the experience and technology in particle physics.

The important part is the separation of model and method. The way collaborative statistical modeling works at ATLAS is that it is really just the way how models are built, investigated and debugged. The methods (inference, generation, confidence intervals, credibility intervals, posterior probabilities, hypothesis tests, ...) are done by tools that take a model as input. Any method — no matter whether Frequentist or Bayesian — can be applied to any model.

Links:

Web page about our poster at the opening of the NYU Center for Data Science.
PDF of the poster.

PhD Thesis

2014-08-17T00:00:00+02:00

Higgs Boson Discovery and First Property Measurements using the ATLAS Detector

Last May, I finished my PhD in Particle Physics. I had a great time studying physics and doing research at some of the best places: the University of Edinburgh, Scotland, for my bachelor's and master's degrees and New York University for my PhD including a year at CERN in Geneva, Switzerland. I also had great advisors: Tilman Plehn, Thomas Gregoire and Kyle Cranmer.

For my PhD, I was working on the discovery of the Higgs boson. CERN was an amazing place during that time, with the best particle physicists from all over the world working together. I made substantial contributions to the discovery in the ATLAS collaboration. I was the first person to combine two search channels and to see the 5σ discovery threshold being breached (blog post on the New York Times article). I created the plot on the right that was published in the ATLAS discovery paper, Science and many other places. I also worked on measuring the Higgs boson mass and the coupling strengths to other particles. A large part of my time was dedicated to statistical modeling and the development of analysis tools, some of which are now part of CERN's ROOT data analysis tool and its statistics extension RooStats.

Download Thesis: PDF

Databench

2014-06-03T00:00:00+02:00

Databench is a data analysis tool using Flask, Socket.IO and d3.js with optional parallelization with Redis Queue and visualization with mpld3. Check out the live demos.

Seriously, check out the live demos.

All source codes are available on GitHub:

Motivation

I like Python for data analysis. However, the frontends for visualization are poor. d3.js is a great library for JavaScript and the web-browser is a powerful user interface. Databench makes Python communicate with the web frontend with minimal effort.

The frontend can be interactive (real-time communication goes both ways between Python and JavaScript/d3.js) and can contain explanatory text and documentation.

To run Databench, you need to install it with pip:

pip install git+https://github.com/svenkreiss/databench.git

(preferably inside a virtualenv). Then you create an analyses folder, run databench on the command line

(venv)analysisfolder$ databench
Registering analysis simplepi as blueprint in flask.
Registering analysis slowpi as blueprint in flask.
Registering analysis mpld3pi as blueprint in flask.
Registering analysis mpld3PointLabel as blueprint in flask.
Registering analysis mpld3Drag as blueprint in flask.
Connecting socket.io to simplepi.
Connecting socket.io to slowpi.
Connecting socket.io to mpld3pi.
Connecting socket.io to mpld3PointLabel.
Connecting socket.io to mpld3Drag.
--- databench ---
 * Running on http://0.0.0.0:5000/
 * Restarting with reloader
Registering analysis simplepi as blueprint in flask.
Registering analysis slowpi as blueprint in flask.
Registering analysis mpld3pi as blueprint in flask.
Registering analysis mpld3PointLabel as blueprint in flask.
Registering analysis mpld3Drag as blueprint in flask.
Connecting socket.io to simplepi.
Connecting socket.io to slowpi.
Connecting socket.io to mpld3pi.
Connecting socket.io to mpld3PointLabel.
Connecting socket.io to mpld3Drag.
--- databench ---

and point your web-browser to http://localhost:5000/.

Example Analysis: `simplepi`

Create a project-folder with this structure:

- analyses
    - templates
        - simplepi.html
    - __init__.py
    - simplepi.py

On the command line, all that is necessary is to run databench and the url (usually http://localhost:5000) will be shown that you can open in a web browser.

This is the backend in simplepi.py (updated June 10, 2014):

"""Calculating \\(\\pi\\) the simple way."""

import math
from time import sleep
from random import random

import databench


simplepi = databench.Analysis('simplepi', __name__)
simplepi.description = __doc__
simplepi.thumbnail = 'simplepi.png'

@simplepi.signals.on('connect')
def onconnect():
    """Run as soon as a browser connects to this."""

    inside = 0
    for i in range(10000):
        sleep(0.001)
        r1 = random()
        r2 = random()
        if r1*r1 + r2*r2 < 1.0:
            inside += 1

        if (i+1)%100 == 0:
            draws = i+1
            simplepi.signals.emit('log', {'draws':draws, 'inside':inside})

            p = float(inside)/draws
            uncertainty = 4.0*math.sqrt(draws*p*(1.0 - p)) / draws
            simplepi.signals.emit('status', {
                'pi-estimate': 4.0*inside/draws,
                'pi-uncertainty': uncertainty
            })

    simplepi.signals.emit('log', {'action': 'done'})

The analysis waits for the connect signal and then starts an analysis. It provides the frontend with live updates through signals.emit() where some of the emit() messages are for the log window and some are status updates.

The frontend now has to listen to the signals that are emitted by the backend and act on them. The frontend simplepi.html is a jinja2 template with math rendered with MathJax using $ ... $ for inline math and $$ ... $$ for display math (updated June 10, 2014):

{% extends "base.html" %}


{% block title %}simplepi{% endblock %}


{% block content %}
<h1>
    simplepi
    <small><i>π = <span id="pi">0.0 ± 1.0</span></i></small>
</h1>

<p>This little demo uses two random numbers \(r_1\) and \(r_2\) and
then does a comparison $$r_1^2 + r_2^2 &le; 1.0$$ to figure out whether
the generated point is inside the first quadrant of the unit circle.</p>

<pre id="log"></pre>
{% endblock %}


{% block footerscripts %}
<script>
    var databench = Databench('simplepi');
    databench.genericElements.log($('#log'));

    databench.signals.on('status', function(msg) {
        $('#pi').text(
            msg['pi-estimate'].toFixed(3)+' ± '+
            msg['pi-uncertainty'].toFixed(3)
        );
    });
</script>
{% endblock %}

You may want to extend the Databench base template giving you the header and footer and some standard libraries, but you can also write your own. The block content is the HTML part of the frontend with fields for the results and an explanation about the algorithm. The block footerscripts provides the frontend logic. It wires the log signals to the #log field with databench.genericElements.log($('#log')). It also starts listening for status signals. When a status signal is received, it executes the callback function where msg contains a JSON representation of the dictionary that the backend sent when emitting status.

And last, to make Databench aware of this analysis, add it to the __init__.py:

import simplepi

This is all that is necessary to create an analysis in Databench. Now you can run databench in the project-folder and visit http://localhost:5000 to run and see the output of the analysis.

Plotting with `matplotlib`

If you like Python, but are not too familiar with d3.js, you can use mpld3 to embed your python plots on the web. The mpld3 website has a nice gallery of examples that should all work in Databench. Two of them -- one with a standard plugin and one with a custom plugin -- are mpld3PointLabel and mpld3Drag which are both included in the live demos and the databench_examples repository.

Parallelization

Examples with parallel processing cannot be included in the live demos but are included in the databench_examples repository.

The slowpi example contains a demo-implementation of using Redis Queue for parallelization. The parallelization is fully implemented on the analysis-side without Databench knowing about it. Other parallelization techniques like Celery and RabbitMQ are probably working but are not tested yet.

dvds-js version 0.1.0

2014-04-25T00:00:00+02:00

This article and dvds-js are outdated :(

Distributed Versioned Data Structures in JavaScript. Like git in js. Checkout the code on github.com/svenkreiss/dvds-js.

The aim of dvds-js is to have a container (or repository) for data structures in JavaScript that you can fork(), serialize and send over the wire, commit() to and then stream back and merge() with full conflict resolution. Here, data structures means anything that can be serialized with JSON.

This post is about the first development release, version 0.1.0.

Example

A repository a is created holding an array with the two names Paul and Adam. Then this repository is forked and the fork is called b. Both a and b are then modified. To demonstrate streaming capabilities, repository b is stringified before and after the manipulation. At the end b is merged into a and the result is shown below.

require(['dvds', 'dvds.visualize'], function() {
    var a = new dvds.Repository(['Paul','Adam']);
    var b = a.fork();
    var bString = JSON.stringify(b);

    // send bString to a different machine and make it a repository again
    var bStreamed = dvds.Repository.parseJSON( JSON.parse(bString) );
    bStreamed.data[0] = 'Karl';
    bStreamed.data[1] = 'Peter';
    // convert to a string again to send back
    var bStreamedString = JSON.stringify(bStreamed);

    // meanwhile on a
    a.data[0] = 'Paula';

    // receive the modified b repository
    var bReceived = dvds.Repository.parseJSON( JSON.parse(bStreamedString) );
    a.merge(bReceived);

    // update html output
    $("#test1Out").text(JSON.stringify(a.data));

    // visualize
    dvds.visualize.CommitGraph(d3.select('#test1Graph'))(a);
    dvds.visualize.CommitGraph(d3.select('#test2Graph'))(bReceived);
});

Live output: ?

Edit on http://jsfiddle.net/3Ruat/11/.

Graph of Commits

Repositories are created with commit 0 shown on the left and then develop towards the right with the last commit on the far right. The second graph shows a merge of a and b as the last commit. This is a live visualization of the two repositories in the example.

Repository b:

Repository a merged with b:

Features

special merge algorithms for nested arrays and objects (e.g. arrays inside of objects inside of arrays inside of an object)
commit hash is built over the commit's data, but also over the entire parent-tree which means that the commit id can validate the entire parent-tree
a repository exposes the data member that behaves like a normal js variable (e.g. can be used in angular.js directly)
visualization (currently only CommitGraph) is factored into its own submodule visualize
unit tests run with Jasmine and Karma, jscs is used to check code style, uglify is used to build min version and automation is done with grunt

Setup

dvds-js is an AMD library. You can load it using require-js in the browser as in the example above. The setup looks something like this:

<script src="http://s3.amazonaws.com/flaskApp_static/static/d3/d3.v3.min.js" charset="utf-8"></script>
<script src="http://requirejs.org/docs/release/2.1.2/minified/require.js"></script>
<script>
require.config({
    paths: {
        'crypto-js.SHA3': 'http://crypto-js.googlecode.com/svn/tags/3.1.2/build/rollups/sha3',
        'dvds': 'http://svenkreiss.github.io/dvds-js/lib/dvds-0.1.0/dvds.min',
        'dvds.visualize': 'http://svenkreiss.github.io/dvds-js/lib/dvds-0.1.0/dvds.min',
    },
    shim: {
        'crypto-js.SHA3': {
            exports: 'CryptoJS'
        }
    }
});
</script>

This includes d3.js for visualizations and CryptoJS is needed for calculating unique identifiers for commits. In node.js, this setup is not necessary and you would simply use require().

Appendix: Static image of commit graphs

Vimeo liquid tag for Pelican

2014-03-07T04:41:00+01:00

Testing my implementation of the vimeo tag for liquid_tags. This is based on the youtube tag which in turn is based on the jekyll / octopress youtube tag.

The syntax is the same as for the youtube tag:

{% vimeo id [width height] %}

Update: The code is now merged into the main pelican-plugins repository on github: https://github.com/getpelican/pelican-plugins

Tests with different sizes

morphDemo

2014-03-06T10:10:00+01:00

This is an interactive demo for a new morphing algorithm with special properties that are motivated from Physics. It uses KD trees and kernel density estimates that are calculated in real time in this demo. All visualization is done using d3.js and custom code for KD trees and kernel densities in JavaScript.

Link: morphDemo.html

Chasing the Higgs - New York Times

2014-03-01T02:00:00+01:00

My part of the Higgs discovery story in the New York Times.

That "joyful expletive" was "Holly shit" (not correcting the typo) sent from his phone.
Read the full story on the New York Times website from March 5, 2013.

A Nobel Prize Party: Cheese, Bubbles, and a Boson - The New Yorker

2013-10-10T08:00:00+02:00

A funny and only approximately accurate article about how we celebrated the Nobel Prize for Peter Higgs and François Englert at NYU in the New Yorker:

Sven Kreiss, Cranmer’s graduate student, was the first to see the statistical evidence needed to claim the discovery in, June, 2012. He told me in a strong German accent that they don’t host a lot of parties in the physics lounge. “We are very serious here,” he said.

Kreiss has a little goatee and was wearing a black T-shirt with an unzipped grey sweatshirt. He remained stoic as he recalled the moment, at CERN, the research center in Geneva, when he saw their research cross the finish line, confirming the particle’s existence. Kreiss was working on ATLAS (A Toroidal L.H.C. Apparatus), one of seven experiments being conducted at the Large Hadron Collider, and was on one of two detector teams going after the Higgs boson. “It’s a graph,” Kreiss said of what he saw at the time. “It has some lines. The line, it goes down like this”—he swooped his hand down—“and if the line goes down far enough, then you say you’ve discovered a new particle.” He shrugged.

...

Kreiss didn’t immediately think that the finding was Nobel-worthy. “It was combined with a lot of exhaustion,” he said. “You’re tired, you think about this, you go out and come back in. Actually, I had a good night’s sleep for the first time in a while. And then, in the morning, I came back and e-mailed this to my professor. It was his birthday, so I said, ‘Happy birthday.’ ” That was June 25, 2012. Cranmer can remember how excited he was to receive the note. “I wrote back, ‘Holy shit,’ ” he said. “But I misspelled ‘holy.’ Too many ‘L’s.”

The full story is here: http://www.newyorker.com/tech/elements/a-nobel-prize-party-cheese-bubbles-and-a-boson

My robot Number4 in the Leipziger Volkszeitung

2004-01-01T00:00:00+01:00

Newspaper article about the competition Jugend Forscht. In 2004 this project got me the first place in the regional and state competition. The photo was taken at the national competition in Saarbruecken.

Sven Kreiss

Achievements in Two Years at EPFL

Past

Current

Artisanal S2 Cells

Python at a University Lab

AncientML 1

A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955 (McCarthy et al., 2006), PDF

The Mathematical Theory of Communication (Shannon et al., 1951), PDF

Bibliography

pelican-jsmath Plugin

Packaging Pelican Plugins

Testing Packaged Plugins

Sample

Pelican 2018

Pelican Plugins

Stream Processing in pysparkling

Counting Example

Decouple

Bibliography

Databench v0.4

word2vec on Databricks

Using word2vec on Databricks

Mapper Tool

Summary

Parallel Processing with pysparkling

Benchmark

Results

pysparkling Talks

pysparkling

Quickstart

Further Reading

Wildcardians on Twitter

Collaborative Statistical Modeling

PhD Thesis

Download Thesis: PDF

Databench

Motivation

Example Analysis: simplepi

Plotting with matplotlib

Parallelization

dvds-js version 0.1.0

This article and dvds-js are outdated :(

Example

Graph of Commits

Features

Setup

Appendix: Static image of commit graphs

Vimeo liquid tag for Pelican

Tests with different sizes

morphDemo

Chasing the Higgs - New York Times

A Nobel Prize Party: Cheese, Bubbles, and a Boson - The New Yorker

My robot Number4 in the Leipziger Volkszeitung

Example Analysis: `simplepi`

Plotting with `matplotlib`