Sven Kreisshttps://www.svenkreiss.com/2020-04-03T00:00:00+02:00Achievements in Two Years at EPFL2020-04-03T00:00:00+02:002020-04-03T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2020-04-03:/blog/epfl-two-years/<p>Summary of achievements in my first two years back in academia at EPFL.</p><p><img class="image-process-crisp" src="/images/twoyearsepfl_crowd_pifpaf.png" alt="PifPaf crowd" /></p>
<p>Two years ago, I returned from New York City to Geneva. I had left my
industry job and returned to academia as a postdoc in Computer Vision at
EPFL. It has been an exciting two years for me in Alexandre Alahi's lab
for <a href="https://www.epfl.ch/labs/vita/">Visual Intelligence for Transportation</a>.
When entering academia from industry, I had no research
or publication momentum. As a starting point,
I tried to make connections to my previous work in physics,
particularly in statistical and physical modeling.
This led me to focus on <em>Composite Fields</em> for human pose estimation.</p>
<h2>Past</h2>
<p>In an extremely short time, we have created three
exciting projects with interesting scientific contributions and state-of-the-art
results on challenging tasks like Human Pose Estimation. These papers were
presented at ICRA, CVPR and ICCV which are the top-tier conferences
in robotics and computer vision.</p>
<ul>
<li><strong>Crowd-Robot Interaction</strong>: Crowd-aware Robot Navigation with Attention-based Deep Reinforcement Learning.<br />
Presented at ICRA2019 in Montreal. Lead by Changan Chan and Yuejiang Liu.<br />
<span style="white-space: nowrap"><a href="https://doi.org/10.1109/ICRA.2019.8794134"><i class="fa fa-file"></i> paper</a></span>
<span style="white-space: nowrap"><a href="https://github.com/vita-epfl/crowdnav"><i class="fa fa-github"></i> code</a></span></li>
<li><strong>PifPaf</strong>: Composite Fields for Human Pose Estimation.<br />
Presented at CVPR2019 in Los Angeles.<br />
<span style="white-space: nowrap"><a href="http://openaccess.thecvf.com/content_CVPR_2019/html/Kreiss_PifPaf_Composite_Fields_for_Human_Pose_Estimation_CVPR_2019_paper.html"><i class="fa fa-file"></i> paper</a></span>
<span style="white-space: nowrap"><a href="https://github.com/vita-epfl/openpifpaf"><i class="fa fa-github"></i> code</a></span></li>
<li><strong>MonoLoco</strong>: Monocular 3D Pedestrian Localization and Uncertainty Estimation.<br />
Presented at ICCV2019 in Seoul. Lead by Lorenzo Bertoni.<br />
<span style="white-space: nowrap"><a href="http://openaccess.thecvf.com/content_ICCV_2019/html/Bertoni_MonoLoco_Monocular_3D_Pedestrian_Localization_and_Uncertainty_Estimation_ICCV_2019_paper.html"><i class="fa fa-file"></i> paper</a></span>
<span style="white-space: nowrap"><a href="https://github.com/vita-epfl/monoloco"><i class="fa fa-github"></i> code</a></span></li>
</ul>
<p>I also presented an overview of our
work at the Autonomous Driving workshop at ICML2019 in Los Angeles, at KAIST
in South Korea as well as at the European and Swiss transportation
conferences hEART (Budapest) and STRC (Ascona).</p>
<p><a href="https://github.com/vita-epfl/openpifpaf">PifPaf</a> has also created interest in
industry. Setting up a commercialization
pipeline with EPFL's Technology Transfer Office was another exciting first.</p>
<p>At EPFL, it is also possible to gain professor-style teaching experience
as a postdoc through co-teaching. Alex and I are currently co-teaching for the second time
<em>Deep Learning for Autonomous Vehicles</em> to master students.
The course project ends with a race of human-robot pairs. The student is in front
and the robot has to track and follow as fast as possible.
At was <a href="https://www.epfl.ch/labs/vita/teaching/">quite an event last year</a>
(<a href="https://www.youtube.com/watch?v=3AnXPqoIfvU">video</a>).
Last semester, we
gave an introductory course in machine learning tailored for engineering students
in the bachelor program.</p>
<h2>Current</h2>
<p>I have three papers under review that extend my
previous work in human pose estimation and joint work on pedestrian localization and metric learning. All of these are very exciting and I am looking forward to share more soon.</p>
<p>Another new process for me was applying for grants. I successfully obtained
a <em>Spark Grant</em> from the Swiss National Science Foundation for the
development of <em>Deep Social Force</em> that pays for the project for one year -- my
first time being a PI on a grant.
I also drafted our lab's contribution to the
<a href="https://www.epfl.ch/research/services/fund-research/funding-opportunities/research-funding/interdisciplinary-seed-fund/isf-granted-projects/">EPFL Interdisciplinary Seed Fund</a>
project to use
<em>Artificial Intelligence for understanding and overcoming motor deficits</em>
in collaboration with two labs at the hospital in Lausanne
that work on Parkinson and Spinal Cord Injury.</p>
<p>I am also looking forward to ICML this year. It is the first time I am the
co-organizer of a workshop, the workshop on <em>AI for Autonomous Driving</em>, which
had 400 attendees last year. Luckily, I am surrounded by an experienced team
and I am looking forward to learn. Holding this workshop entirely online will
be a first for everyone though.</p>
<p>After two years back in academia, I am still not bored. Not at all.
<strong>Thanks VITA team!</strong> Onwards.</p>Artisanal S2 Cells2018-11-28T00:00:00+01:002018-11-28T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2018-11-28:/blog/artisanal-s2/<p>S2 cells are a powerful tool to index and compute over geospatial data. This short post shows how to create S2 cell ids by hand.</p><p>S2 cells are a great representation for locations on earth and are the storage
format for many popular web services, including Google Maps, Foursquare and Pokemon Go.
I worked on the <a href="https://s2sphere.readthedocs.io">s2sphere</a> Python implementation when I was at Sidewalk Labs. I also wrote some
background notes on the <a href="https://www.sidewalklabs.com/blog/s2-cells-and-space-filling-curves-keys-to-building-better-digital-map-tools-for-cities/">Sidewalk blog</a>.
Recently, I was asked to explain
how to verify manually that an S2 cell id is correct. By hand. From scratch.
Here is my answer for the simplest ids.</p>
<p>Get a feel for the cells and faces of the cube by using the web tools
at <a href="https://s2.sidewalklabs.com">s2.sidewalklabs.com</a>:</p>
<p><img style="width:58.5%" src="/images/s2cell_regioncoverer.png" alt="S2Cell region coverer with example cell tokens" />
<img style="width:39.7%" src="/images/s2cell_globe.png" alt="S2Cell globe" /></p>
<p>The above images show the location of face 0 on earth. Below is the unfolded
curve how it spans and connects to the other faces of the cube:</p>
<p><img style="max-height: 15em; display:block; margin:1em auto 2em;" src="/images/s2cell_faces.png" alt="S2Cell cube faces" /></p>
<p>Tokens are hex format with right zeros stripped. To recover an integer, use
<a href="https://s2sphere.readthedocs.io/en/latest/api.html#s2sphere.CellId.from_token"><code>from_token()</code></a>
and print as binary. The example tokens converted to binary cell ids are:</p>
<div class="highlight"><pre><span></span><code>token "04": 0000010000000000000000000000000000000000000000000000000000000000
token "0c": 0000110000000000000000000000000000000000000000000000000000000000
token "14": 0001010000000000000000000000000000000000000000000000000000000000
token "1c": 0001110000000000000000000000000000000000000000000000000000000000
</code></pre></div>
<p>The binary format from left to right: three bits for face (here face 0), two bits to encode the cell on that face, 1 bit terminating. In agreement with what is in the docstring for the
<a href="https://s2sphere.readthedocs.io/en/latest/api.html#s2sphere.CellId">CellId</a> class :)</p>
<p>Just to compare, face 1, level 1 cell ids are:</p>
<div class="highlight"><pre><span></span><code><span class="mf">0010010000000000000000000000000000000000000000000000000000000000</span>
<span class="mf">0010110000000000000000000000000000000000000000000000000000000000</span>
<span class="mf">0011010000000000000000000000000000000000000000000000000000000000</span>
<span class="mf">0011110000000000000000000000000000000000000000000000000000000000</span>
</code></pre></div>
<p>Et voilà. Those are eight hand-crafted S2 cell ids.</p>Python at a University Lab2018-11-23T00:00:00+01:002018-11-23T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2018-11-23:/blog/lab-python/<p>University research labs have a different structure than corporate research labs and their compute setups. In this post, I list useful Python resources for starting in a university machine learning lab.</p><p><img class="img-thumbnail float-right" width=200 src="/images/python-powered-university.png" alt="Python powered University" /></p>
<p><em>I am a post-doc at <a href="https://vita.epfl.ch/">VITA lab at EPFL university</a>.</em></p>
<p>Good software engineering practices are important to me. They are foundational
for open and reproducible science. You -- a current or future
member of a Machine Learning lab -- are not necessarily expected to be a computer scientist
or software engineer, but the tools must be familiar.
Expert-level familiar. You will use these tools most of the day.
You should not be clumsy with or afraid of these tools.
Below is an ambitious list, so don't worry if you are not familiar with most of it yet.</p>
<p><strong>Python Core:</strong>
The best way to get started and a great resource for advanced users is
the <a href="https://docs.python.org/3/tutorial/index.html">tutorial on the official Python webpage</a>.
Here are my notes more specific to ML.</p>
<ul>
<li>always use Python3 now, not Python2</li>
<li>read Python code of great projects outside of ML (almost no ML projects are written well, see closing remarks):
start with <a href="https://github.com/requests/requests"><code>requests</code></a></li>
<li>Python core and standard libraries (always available, so use them):<ul>
<li><a href="https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions">list comprehensions</a></li>
<li><a href="https://docs.python.org/3/tutorial/classes.html">classes</a>:
when your function becomes too long it might want to be a class</li>
<li><a href="https://docs.python.org/3/tutorial/datastructures.html#sets">sets</a>:
avoid testing whether an element is in a list, test whether it is in a set</li>
<li><a href="https://docs.python.org/3/library/argparse.html">argparse</a>, use a standard library to parse command line arguments</li>
<li><a href="https://docs.python.org/3/library/logging.html#module-logging">logging</a></li>
<li><a href="https://docs.python.org/3/library/collections.html#collections.defaultdict"><code>defaultdict</code></a>
for sparse accumulators</li>
<li>generators and <code>yield</code>, see <a href="https://docs.python.org/3/howto/functional.html">functional programming howto</a></li>
<li>all <a href="https://docs.python.org/3/library/itertools.html"><code>itertools</code></a>
(e.g. <a href="https://docs.python.org/3/library/itertools.html#itertools.chain.from_iterable"><code>itertools.chain.from_iterable</code></a>)</li>
<li><a href="https://docs.python.org/3/library/functools.html#functools.lru_cache"><code>functools.lru_cache</code></a> is a powerful tool</li>
<li><a href="https://docs.python.org/3/library/functions.html#property"><code>property</code> decorator</a> is great for adding caches</li>
<li>more advanced but useful: context managers, decorators</li>
</ul>
</li>
<li>understand stack traces (the basics: the line above is where the line below originated), <a href="https://en.wikipedia.org/wiki/Stack_trace">Wikipedia</a></li>
<li>scientific foundational packages:
<a href="http://www.numpy.org/"><code>numpy</code></a>,
<a href="https://scipy.org/getting-started.html"><code>scipy</code></a>,
<a href="https://scikit-learn.org/stable/"><code>scikit-learn</code></a>
and depending on your focus <a href="https://cython.org/"><code>cython</code></a></li>
</ul>
<p><strong>Packaging:</strong>
As soon as you need to re-use code in a second file, you will want to import
one of your own files. You should use relative imports and to use those
properly you need to package your code. Packaging in Python used to be a mess,
but much progress has been made over recent years.
When you search for help related
to packaging, filter for results within the last year.</p>
<ul>
<li><code>pip</code>, I have never used <code>conda</code></li>
<li>relative imports: see <a href="https://docs.python.org/3/tutorial/modules.html">tutorial section on Modules</a></li>
<li><code>setup.py</code>: see <a href="https://setuptools.readthedocs.io/en/latest/setuptools.html#basic-use"><code>setuptools</code> basic use</a></li>
<li><a href="https://docs.python.org/3/tutorial/venv.html"><code>venv</code></a>: I never do anything outside a virtual environment</li>
<li><a href="https://github.com/jupyter/notebook">Jupyter notebooks</a>:
Great for demos and log books. <em>Never</em> to code.</li>
<li><a href="https://mybinder.org/">mybinder.org</a> to make your Jupyter demos interactive</li>
</ul>
<p><strong>Style / Unit Tests / Continuous Testing:</strong>
Standard practices for software engineers are useful for ML projects, too. They help.
Beyond testing code, unit tests in combination with continuous integration give
a robust and reproducible starting point for anyone picking up your project
(including yourself in one year).</p>
<ul>
<li><a href="https://www.python.org/dev/peps/pep-0008/">PEP8</a> is the <em>Style Guide for Python</em> and contains explanations. Don't skip <a href="https://www.python.org/dev/peps/pep-0008/#a-foolish-consistency-is-the-hobgoblin-of-little-minds"><em>A Foolish Consistency is the Hobgoblin of Little Minds</em></a>. Still, I follow almost all of PEP8 to the letter. Additional:<ul>
<li>no abbreviations for variables, classes, functions, etc.</li>
<li>do not iterate over indices in Python: don't do <code>for i in range(len(mylist))</code></li>
</ul>
</li>
<li><a href="https://www.pylint.org/"><code>pylint</code></a> generally provides good advice beyond checking for PEP8</li>
<li><a href="https://docs.pytest.org/en/latest/"><code>pytest</code></a> to run all your tests</li>
<li><a href="https://circleci.com/">CircleCI</a> and <a href="https://travis-ci.org/">TravisCI</a> can automatically run your tests on every commit to git</li>
</ul>
<p><strong>Red Flags:</strong>
When browsing open source code, this is when I get worried.</p>
<ul>
<li>no eval metrics/scripts for that implementation</li>
<li>have to change the <code>PATH</code> or <code>PYTHONPATH</code> variables</li>
<li>have to copy or symlink folders with code</li>
</ul>
<p><strong>Closing:</strong>
Infrastructures at large companies are different from a university lab and
your laptop: containers, distributed storage, build systems, custom DNS,
mono repositories. A small piece in the middle of the stack is open source and
has been duct-taped to make it work without the rest of the infrastructure.
Don't follow blindly what seems like a standard practice of the best software
engineers in the world. It might be an artifact of porting it to the
open source world.</p>
<p>Feedback is welcome on Twitter @svenkreiss.</p>AncientML 12018-03-05T00:00:00+01:002018-03-05T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2018-03-05:/blog/ancientml-2018-03/<p>AncientML is a series of paper reading notes. This first edition covers the first mention of AI and the Mathematical Theory of Communication.</p><p><img class="img-thumbnail float-right" src="/images/ancientml-logo.png" width="300" alt="AncientML Logo" /></p>
<p>AncientML is a series of paper reading notes. The purpose is to review
outstanding contributions to machine learning that are valuable to the
formation as an academic field.</p>
<div style="clear:both"></div>
<h2>A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955 <a href='#mccarthy2006proposal' id='ref-mccarthy2006proposal-1'>(McCarthy et al., 2006)</a>, <a href="https://www.aaai.org/ojs/index.php/aimagazine/article/download/1904/1802">PDF</a></h2>
<ul>
<li>The paper/event that gets credited with the foundation of the field of Artificial Intelligence research.</li>
<li>The paper is three pages long and the authors include Claude Shannon.</li>
<li>scale of the proposed project: 2 months, 10 men</li>
<li>focused on language, abstraction and concepts</li>
<li>identifies seven areas to improve: Automatic Computers, How Can a Computer be
Programmed to Use a Language, Neuron Nets, Theory of the Size of a Calculation,
Self-Improvement, Abstractions, Randomness and Creativity</li>
<li>"the major obstacle is not lack of machine capacity, but our inability to write programs"</li>
<li>There is Wikipedia article on the <a href="https://en.wikipedia.org/wiki/Dartmouth_workshop">Dartmouth workshop</a>.</li>
<li>102 pages of Ray Solomonoff's
<a href="http://raysolomonoff.com/dartmouth/notebook/notebook.html">hand written notes</a>
including some doodles on page 3.</li>
</ul>
<h2>The Mathematical Theory of Communication <a href='#shannon1951mathematical' id='ref-shannon1951mathematical-1'>(Shannon et al., 1951)</a>, <a href="http://pubman.mpdl.mpg.de/pubman/item/escidoc:2383164/component/escidoc:2383163/Shannon_Weaver_1949_Mathematical.pdf">PDF</a></h2>
<ul>
<li>Central paper for many fields. 90 pages (skip the part by Weaver).</li>
<li><em>The Idea Factory</em> <a href='#ideafactory2012gertner' id='ref-ideafactory2012gertner-1'>(Gertner, 2012)</a> is a book about Bell Labs around that time.</li>
<li><a href='#khinchin1957mathematical' id='ref-khinchin1957mathematical-1'>Khinchin (1957)</a> is a book that discusses this paper.</li>
<li>p.49: <em>information</em> is not attached to a particular message but to the amount of
freedom of choice</li>
<li>p.49: "decomposition of choice" is a beautiful requirement for <span class="math">\(H\)</span>, and leads with
the other two requirements to a unique form for <span class="math">\(H\)</span></li>
<li>p.50: simple example to visualize the connection between probability of a message and information is shown in the figure below</li>
<li>p.53: origin for terms of the form <span class="math">\(p_i\log{}p_i\)</span></li>
<li>p.56: relative entropy, maximum possible compression, redundancy</li>
<li>p.70: capacity of a noisy channel; includes a <code>max()</code> over all possible information sources</li>
</ul><hr>
<h2>Bibliography</h2>
<p id='ideafactory2012gertner'>Jon Gertner.
<em><span class="bibtex-protected">The Idea Factory: Bell Labs and the great age of American innovation</span></em>.
Penguin Press, New York, 2012.
ISBN 978-0143122791. <a class="cite-backref" href="#ref-ideafactory2012gertner-1" title="Jump back to reference 1">↩</a></p>
<p id='khinchin1957mathematical'>A Khinchin.
<em>Mathematical foundations of information theory</em>.
Dover Publications, New York, 1957.
ISBN 978-0486604343. <a class="cite-backref" href="#ref-khinchin1957mathematical-1" title="Jump back to reference 1">↩</a></p>
<p id='mccarthy2006proposal'>John McCarthy, Marvin L Minsky, Nathaniel Rochester, and Claude E Shannon.
<span class="bibtex-protected">A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955</span>.
<em>AI magazine</em>, 27(4):12, 2006. <a class="cite-backref" href="#ref-mccarthy2006proposal-1" title="Jump back to reference 1">↩</a></p>
<p id='shannon1951mathematical'>Claude E Shannon, Warren Weaver, and Arthur W Burks.
<span class="bibtex-protected">The Mathematical Theory of Communication</span>.
<em>The University of Illinois Press</em>, 1951. <a class="cite-backref" href="#ref-shannon1951mathematical-1" title="Jump back to reference 1">↩</a></p>
pelican-jsmath Plugin2018-02-18T00:00:00+01:002018-02-18T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2018-02-18:/blog/pelican-jsmath/<p>An <span class="math">\(\alpha\omega\epsilon s \sigma m \epsilon\)</span> Pelican plugin to render math in JavaScript libraries like KaTeX. This plugin makes sure that equations are preserved in the Markdown and Restructured Text parsers and get reproduced properly in HTML for a JavaScript renderer to process.</p><p>The new plugin is <span class="math">\(\alpha\omega\epsilon s \sigma m \epsilon\)</span>, particularly
in combination with KaTeX which is used here. It has good support for big
equations: </p>
<div class="math">$$E=mc^2$$</div>
<p>
A vector <span class="math">\(\vec{a}\)</span> looks beautiful. Writing
order of magnitudes with <span class="math">\(\mathcal{O}(n)\)</span> is pretty. There was a related
<a href="https://github.com/getpelican/pelican-plugins/issues/625">Pelican issue</a>
for support for KaTeX.</p>
<p>The plugin is packaged and can be installed with <code>pip install pelican-jsmath</code>
(with a dash) and then added to Pelican in <code>pelicanconf.py</code> by adding
<code>'pelican_jsmath'</code> (with an underscore) to your <code>PLUGINS</code> list. See the
<a href="https://github.com/svenkreiss/pelican-jsmath">Readme</a> for more details.</p>
<h2>Packaging Pelican Plugins</h2>
<p>It is great that Pelican supports plugins installed via <code>pip</code> and outside the
plugins directory. It gives the plugin author and user more control over the
plugin version. This is why I wanted to document the steps I took to make
<code>pelican-jsmath</code> a Python package.</p>
<p>Simplest <code>setup.py</code> file:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">setuptools</span> <span class="kn">import</span> <span class="n">setup</span>
<span class="n">setup</span><span class="p">(</span>
<span class="n">name</span><span class="o">=</span><span class="s1">'pelican-jsmath'</span><span class="p">,</span>
<span class="n">version</span><span class="o">=</span><span class="s1">'0.1.0'</span><span class="p">,</span>
<span class="n">description</span><span class="o">=</span><span class="s1">'Pelican Plugin that passes math to JavaScript.'</span><span class="p">,</span>
<span class="n">url</span><span class="o">=</span><span class="s1">'http://github.com/svenkreiss/pelican-jsmath'</span><span class="p">,</span>
<span class="n">author</span><span class="o">=</span><span class="s1">'Sven Kreiss'</span><span class="p">,</span>
<span class="n">author_email</span><span class="o">=</span><span class="s1">'me@svenkreiss.com'</span><span class="p">,</span>
<span class="n">license</span><span class="o">=</span><span class="s1">'AGPL-3.0'</span><span class="p">,</span>
<span class="n">packages</span><span class="o">=</span><span class="p">[</span><span class="s1">'pelican_jsmath'</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div>
<p>If you are converting a plugin from the pelican-plugins repository, move your files
into a folder, here <code>pelican_jsmath</code>, and add a <code>setup.py</code> file. That's it.
You can submit it to pypi if you want, but you can also tell people to install
directly from Github using
<code>pip install https://github.com/svenkreiss/pelican-jsmath/zipball/master</code>.</p>
<p>With packaged plugins, you can manage your dependencies in your
<code>requirements.txt</code> as usual.</p>
<h2>Testing Packaged Plugins</h2>
<p>This Pelican plugin includes a plugin for the Python Markdown parser that
modifies the HTML output. It is good to test that this plugin produces valid HTML.
The repository includes an example Pelican site which is regenerated on every
commit and validated with
<a href="https://github.com/svenkreiss/html5validator">html5validator</a>.</p>
<h2>Sample</h2>
<p><img class="image-process-crisp top" alt="pelican-jsmath sample" src="/images/pelican_jsmath_sample.png" /></p>Pelican 20182018-02-10T00:00:00+01:002018-02-10T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2018-02-10:/blog/pelican-2018/<p>Updates to Pelican and this blog. This is a summary of theme changes, a list of my favorite plugins, and a summary of plugins that I updated to improve this website. It also contains a short discussion of the pelican-plugins repository and its potential consequences for the lacking popularity of individual plugins.</p><p><img class="image-process-crisp top" alt="screenshot of this blog" src="/images/pelican_screenshot_Feb2018.png" /></p>
<p>Inspired by Fred Wilson's post about <a href="http://avc.com/2018/01/owning-yourself/">Owning Yourself</a>,
I revived my Pelican blog.
All my website's
<a href="https://github.com/svenkreiss/svenkreiss.github.io/tree/pelican">configuration and source files</a>
are public as well as
<a href="https://github.com/svenkreiss/pure">my modifications to the pure theme</a>.
Reasons for Pelican compared to hosted solutions:</p>
<ul>
<li>content under version control in git</li>
<li>my usual text editor for content creation</li>
<li>fully owning content and its exact presentation</li>
<li>complete customizability, therefore I want Python, therefore Pelican</li>
</ul>
<p>A cost is that I have to contribute some changes myself:</p>
<ul>
<li><a href="https://github.com/svenkreiss/pure">customized Pure theme</a>: print mode, less author mentions, responsive resizing for mobile, using the pygments <code>friendly</code> style for code highlighting</li>
<li><code>pelican_jsmath</code>: using KaTeX: nothing existed to combine it with Pelican and so I created
<a href="https://github.com/svenkreiss/pelican-jsmath">pelican-jsmath</a>.
The new plugin is <span class="math">\(\alpha\omega\epsilon s \sigma m \epsilon\)</span> and described in a separate
<a href="https://www.svenkreiss.com/blog/pelican-jsmath/">blog post</a>.</li>
<li><code>pelican-cite</code>: create a nicely formatted Bibliography from a bibtex file. I created <a href="https://github.com/cmacmackin/pelican-cite/pull/5">PR#5</a> so that it also works on draft pages.</li>
<li><code>image_process</code>: Responsive images (smaller images on smaller devices) are
especially important for the <a href="/projects.html">projects</a> page and on article index page.
The plugin works on all generated files just before they are written, so you can use it everywhere in your theme as well.</li>
<li><code>pelican-advance-embed-tweet</code>: submitted <a href="https://github.com/fundor333/pelican-advance-embed-tweet/pull/2">PR#2</a> to remove align attribute from <code><bockquote></code> which is not HTML5, and you can instead set <code>TWITTER_ALIGN = 'center'</code> in your <code>pelicanconf</code> to center the embedded tweet.</li>
<li><code>gravatar</code>: request higher resolution Gravatar images by adding <code>?s=140</code> to the image url in the theme</li>
<li><code>related_posts</code>: newly added to this blog</li>
<li><code>representative_image</code>: automatically extract an image from an article and use it in article list with <code>image_process</code> to thumbnail</li>
<li><code>pelican_dynamic</code>: adds options to add per article custom <code>css</code> and <code>js</code></li>
</ul>
<p>Open and in-progress issues:</p>
<ul>
<li>hack of the day: make your default status <code>draft</code> (generally a good idea), but then set the default status in <code>publishconf.py</code> to <code>hidden</code>. When trying to publish, <code>hidden</code> will create an error and the file will be skipped. So the article wont exist on the web at all until its status is set to <code>draft</code>. Problem: have to split <code>OUTPUTDIR</code> in the Makefile so that two different directories are used for <code>make devserver</code> and <code>make publish</code> (filed <a href="https://github.com/getpelican/pelican/issues/2284">issue#2284</a> against Pelican; see <a href="https://github.com/svenkreiss/svenkreiss.github.io/blob/pelican/Makefile">Makefile</a>).</li>
<li>Have to wrap Tweet embeds in <code><div></code> to avoid Markdown's <code><p></code> tag because the embedded tweet includes a <code><blockquote></code> which cannot appear inside a <code><p></code> tag.</li>
</ul>
<p>Always validate html with <a href="https://github.com/svenkreiss/html5validator">html5validator</a>
and check links with the <a href="https://validator.w3.org/checklink">W3C Link Checker</a>.</p>
<h2>Pelican Plugins</h2>
<p>Most Pelican Plugins are distributed through the central
<a href="https://github.com/getpelican/pelican-plugins">pelican-plugins</a> repository.
The contributed plugins have fairly little recognition: <code>render_math</code> has 65 stars and
<code>image_process</code> has 10 stars. Issues against individual plugins are filed
in the central repository and not against the individual plugins. Updates to
plugins have to wait for inclusion in the central repository before they become
used by others.</p>
<!-- <div>@svenkreiss/status/960716731059785730</div> -->
<p>Python provides its standard mechanism with <code>pip</code>, <code>setuptools</code>, <code>requirements.txt</code>, etc.
to manage dependencies. Plugins can be written to support <code>pip</code> and Pelican
does support importable plugins. This also allows unit tests and continuous
integration to ensure the quality of the plugin.
This is the method I chose for <a href="https://github.com/svenkreiss/pelican-jsmath">pelican-jsmath</a>.</p>Stream Processing in pysparkling2017-03-11T00:00:00+01:002017-03-11T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2017-03-11:/blog/streamprocessing-in-pysparkling/<p>pysparkling now supports stream processing with discrete streams, called DStream. This post shows a simple example that uses this new API.</p><p><code>pysparkling</code> is a native Python implementation of PySpark. Stream processing
is considered to be one of the most important features of Spark. PySpark
provides a Python interface to Spark’s
<a href="http://spark.apache.org/docs/latest/api/python/pyspark.streaming.html">StreamingContext</a>
and supports consuming from updating HDFS folders and TCP sockets and provides
interfaces to Kafka, Kinesis, Flume and MQTT. Initial support for stream
processing from folders and TCP sockets is in <code>pysparkling 0.4.0</code> which you can
now install with:</p>
<div class="highlight"><pre><span></span><code>pip install --upgrade pysparkling
</code></pre></div>
<h2>Counting Example</h2>
<p>In the normal batch processing way, you can count elements with:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">pysparkling</span> <span class="kn">import</span> <span class="n">Context</span>
<span class="o">>>></span> <span class="n">Context</span><span class="p">()</span><span class="o">.</span><span class="n">parallelize</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="mi">3</span>
</code></pre></div>
<p>This is similar for stream processing. Incoming data is batched every
0.1 seconds (the batch interval — the second parameter to <code>StreamContext</code>) and
elements are counted in 0.2 second windows, i.e. two batch intervals, which
returns the count of the first batch, the count of the first and second batch
and the count of the second and third batch:</p>
<div class="highlight"><pre><span></span><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">pysparkling</span>
<span class="o">>>></span> <span class="n">sc</span> <span class="o">=</span> <span class="n">pysparkling</span><span class="o">.</span><span class="n">Context</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">ssc</span> <span class="o">=</span> <span class="n">pysparkling</span><span class="o">.</span><span class="n">streaming</span><span class="o">.</span><span class="n">StreamingContext</span><span class="p">(</span><span class="n">sc</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)</span>
<span class="o">>>></span> <span class="p">(</span>
<span class="o">...</span> <span class="n">ssc</span>
<span class="o">...</span> <span class="o">.</span><span class="n">queueStream</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">],</span> <span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">4</span><span class="p">],</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]])</span>
<span class="o">...</span> <span class="o">.</span><span class="n">countByWindow</span><span class="p">(</span><span class="mf">0.2</span><span class="p">)</span>
<span class="o">...</span> <span class="o">.</span><span class="n">foreachRDD</span><span class="p">(</span><span class="k">lambda</span> <span class="n">rdd</span><span class="p">:</span> <span class="nb">print</span><span class="p">(</span><span class="n">rdd</span><span class="o">.</span><span class="n">collect</span><span class="p">()))</span>
<span class="o">...</span> <span class="p">)</span>
<span class="o">>>></span> <span class="n">ssc</span><span class="o">.</span><span class="n">start</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">ssc</span><span class="o">.</span><span class="n">awaitTermination</span><span class="p">(</span><span class="mf">0.35</span><span class="p">)</span>
<span class="p">[</span><span class="mi">3</span><span class="p">]</span>
<span class="p">[</span><span class="mi">7</span><span class="p">]</span>
<span class="p">[</span><span class="mi">6</span><span class="p">]</span>
</code></pre></div>
<p>Other new features apart from the <code>pysparkling.streaming</code> module are an
improved <code>pysparkling.fileio</code> module, methods for reading binary files
(<a href="http://pysparkling.trivial.io/en/latest/api_context.html#pysparkling.Context.binaryFiles">binaryFiles()</a> and
<a href="http://pysparkling.trivial.io/en/latest/api_context.html#pysparkling.Context.binaryRecords">binaryRecords()</a>)
and more inline examples in the documentation.</p>
<p>Head over to the <a href="http://pysparkling.trivial.io/en/latest/api_rdd.html">RDD (batch datasets)</a> and
<a href="http://pysparkling.trivial.io/en/latest/api_streaming.html#dstream">DStream (discrete stream)</a>
documentations to learn more!</p>
<p><a href="http://pysparkling.trivial.io"><img alt="API documentation at pysparkling.trivial.io" src="/images/pysparkling_streaming_doc.png"></a></p>Decouple2016-12-31T00:00:00+01:002016-12-31T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2016-12-31:/blog/decouple/<p>Decoupling theoretical uncertainties from measurements of the Higgs boson.</p><p>The paper <em>Decoupling theoretical uncertainties from measurements of the Higgs boson</em> <a href='#decouple' id='ref-decouple-1'>(Cranmer et al., 2015)</a> by Kyle Cranmer, David Lopez-Val, Tilman Plehn and me is now published in <a href="http://journals.aps.org/prd/abstract/10.1103/PhysRevD.91.054032">Phys Rev D91</a> and also available at <a href="https://arxiv.org/abs/1401.0080">arXiv:1401.0080</a>.</p>
<blockquote>
<p>We develop a technique to present Higgs coupling measurements, which decouple the poorly defined theoretical uncertainties associated to inclusive and exclusive cross section predictions. The technique simplifies the combination of multiple measurements and can be used in a more general setting. We illustrate the approach with toy LHC Higgs coupling measurements and a collection of new physics models.</p>
</blockquote>
<p>We share all the software that was involved. That includes that we published the model files and a Mathematica notebook on <a href="http://figshare.com/articles/Supplementary_Material_for_A_Novel_Approach_to_Higgs_Coupling_Measurements_/888607">figshare</a>. We put all the code on <a href="http://github.com/svenkreiss/decouple">github</a> which can be used for new models or to reproduce all plots in the paper simply by typing <code>make</code>. We even have a little <a href="http://github.com/svenkreiss/decoupledDemo">demo project</a> that pulls three pre-made decoupled models from a webpage and recouples them and produces two plots of combined benchmark coupling models.</p>
<p><img src="/images/decouple.png" alt="Decouple part of figure 4" /></p><hr>
<h2>Bibliography</h2>
<p id='decouple'>Kyle Cranmer, Sven Kreiss, David Lopez-Val, and Tilman Plehn.
<span class="bibtex-protected">Decoupling Theoretical Uncertainties from Measurements of the Higgs Boson</span>.
<em>Phys. Rev.</em>, D91(5):054032, 2015.
<a href="https://arxiv.org/abs/1401.0080">arXiv:1401.0080</a>, <a href="https://doi.org/10.1103/PhysRevD.91.054032">doi:10.1103/PhysRevD.91.054032</a>. <a class="cite-backref" href="#ref-decouple-1" title="Jump back to reference 1">↩</a></p>
Databench v0.42016-09-22T00:00:00+02:002016-09-22T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2016-09-22:/blog/databench-v04/<p>New release of Databench that switches the backend from Flask to Tornado, fully supports Python 2 and 3, transpiles ES6 to legacy JavaScript and runs unit tests and coverage on every commit.</p><p><img class="image-process-crisp" src="/images/databench_examples.png" alt="screenshot of index page for examples" /></p>
<p>Databench v0.4 is <a href="https://github.com/svenkreiss/databench/releases/tag/v0.4.0">released</a>.
It is a major change from the v0.3 branch. All <a href="http://databench.trivial.io/">documentation</a>,
<a href="https://github.com/svenkreiss/databench_examples">examples</a> and
<a href="http://databench-examples.trivial.io/">demos</a> are updated.
Install the new version with</p>
<div class="highlight"><pre><span></span><code>pip install --upgrade databench
</code></pre></div>
<p>Here are the highlights:</p>
<ul>
<li>Migrated from <strong>Flask to Tornado</strong> and with that a switch from Jinja2 templates to Tornado templates.</li>
<li>With that new backend, <strong>Python 2.7, 3.4 and 3.5</strong> are supported.</li>
<li>The previous version had many dependencies and a major goal of the refactor was to reduce the number of dependencies. This version only depends on <strong>tornado, pyyaml and pyzmq</strong>. Markdown and docutils to support <em>md</em> and <em>rst</em> readme files are optional.</li>
<li>A <strong>Datastore</strong> was added. This concept encourages a consistent pattern for state that works with multiple threads and languages (see the new part of the documentation on <a href="http://databench.trivial.io/en/latest/quickstart.html#data-flow">data flow</a>).</li>
<li>Front-end code in <strong>ES6</strong> that is transpiled to legacy JavaScript. Also for analysis code, there is built in support for <a href="http://databench.trivial.io/en/latest/frontend.html#node-modules">node_modules</a>.</li>
<li><strong>Unit tests run automatically</strong> on every commit. Also the <strong>documentation is built and updated</strong> on every commit. The test coverage of the code is also updated continuously and is currently at 95%. Unit tests always run for Python 2.7, 3.4 and 3.5.</li>
</ul>
<p>If you want to jump right in, start with the
<a href="http://databench.trivial.io/">documentation</a> and have a look at some
<a href="https://github.com/svenkreiss/databench_examples">examples</a>.</p>word2vec on Databricks2015-12-22T00:00:00+01:002015-12-22T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2015-12-22:/blog/word2vec-on-databricks/<p>Running word2vec on Databricks. A full example of using gensim and distributed maps with Spark to run this Python analysis on Databricks.</p><p>Word2vec is an interesting approach to convert a word into a feature vector
(<a href="https://code.google.com/p/word2vec/">original C code</a> by Mikolov et al).
One of the observations in the original paper was that words with similar
meaning have a smaller cosine distance than dissimilar words. Here is a
histogram of the pairwise cosine distances of about 500 media topics
(derived from <a href="http://cv.iptc.org/newscodes/mediatopic/">IPTC news codes</a>):</p>
<p><img class="image-proces-crisp" src="/images/word2vec_angle.png" alt="distribution of Cosine distances of word vectors" /></p>
<p>Cosine distance is defined as <code>1 - cos(vector1, vector2)</code>. Most of the vector
pairs only have slightly smaller angles than 90° which makes sense as more
topics are unrelated to each other than related. The closest 5% of vector
pairs are still separated by angles up to 73°. The smallest angular separation
is 18° between breaststroke and backstroke and the second smallest
is 27° between <em>triple_jump</em> and <em>pole_vault</em>.</p>
<p>To visualize these topics below, the 300-dimensional word vectors are embedded
in two dimensions using t-SNE. Edges between the topics with the smallest 5% in
cosine distance in the original space are drawn in orange.</p>
<p><img class="image-proces-crisp" src="/images/word2vec_tsne.png" alt="tsne of word vectors" /></p>
<p>Similar topics are indeed close together. However, one could argue that imports
is the opposite of exports and therefore should not be close together; but they
are (at the bottom). Similarly, <em>employment</em> is close to <em>unemployment</em>. This is
not how a person would think about “similarity” in this context, but it makes
sense given the skip-gram training of the word vectors: a neural network tries
to predict a word (here a topic) given a window of surrounding words. These
topics would indeed appear in news articles with similar words surrounding
them. It is important to keep this subtlety in mind when building tools on
top of word2vec.</p>
<h2>Using word2vec on Databricks</h2>
<p>Spark and MLlib come with a built-in implementation of word2vec. However, we
also want to apply word2vec in stand-alone Python and therefore chose the
<code>gensim</code> implementation.</p>
<p>We use Databricks to process a large number of documents (not for training
word2vec, but to apply word2vec). We create a “Mapper Tool” that can convert
text to word vectors that is distributed in a Python <code>egg</code>. This tool reads
in previously created word vectors from a compressed binary file that is larger
than 1GB which takes about a minute.</p>
<p>There are two ingredients that we need: a large binary input file available at
all worker nodes and a way to cache the word vectors in memory across map
operations.</p>
<div class="highlight"><pre><span></span><code><span class="n">dbutils</span><span class="o">.</span><span class="n">fs</span><span class="o">.</span><span class="n">mount</span><span class="p">(</span><span class="s1">'s3n://your_bucket/some_folder'</span><span class="p">,</span> <span class="s1">'/mnt/some_folder'</span><span class="p">)</span>
</code></pre></div>
<p>The default scheme is <code>dbfs:/</code>, not <code>file:/</code>, which means that this S3 folder
is now available in <code>dbfs</code>. <code>dbutils</code> can copy data from <code>dbfs</code> to the local
file system, but only on the driver. On worker instances, <code>dbutils</code> is not
available. However, <code>dbfs</code> is mounted using FUSE at <code>file:/dbfs</code>
(<a href="https://forums.databricks.com/answers/2966/view.html">Databricks Forum post</a>)
and we can use the local file path <code>/dbfs/mnt/some_folder/word2vec_file.bin.gz</code>
on the driver and the workers.</p>
<h2>Mapper Tool</h2>
<p>The tool is a wrapper around the word2vec implementation in the Python package
<a href="https://radimrehurek.com/gensim/models/word2vec.html">gensim</a>,
<code>gensim.models.Word2Vec</code>. We want an in-memory cache that is persistent across
map operations. Python class variables are not serialized when serializing an
instance of a class and therefore we can use it as a process-wide cache.</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">gensim.models.word2vec</span> <span class="kn">import</span> <span class="n">Word2Vec</span>
<span class="k">class</span> <span class="nc">Tool</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="n">cache</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">filename</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fn</span> <span class="o">=</span> <span class="n">filename</span>
<span class="nd">@property</span>
<span class="k">def</span> <span class="nf">word2vec</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="n">fn</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">Tool</span><span class="o">.</span><span class="n">cache</span><span class="p">:</span>
<span class="n">Tool</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">fn</span><span class="p">]</span> <span class="o">=</span> \
<span class="n">Word2Vec</span><span class="o">.</span><span class="n">load_word2vec_format</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fn</span><span class="p">,</span> <span class="n">binary</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="k">return</span> <span class="n">Tool</span><span class="o">.</span><span class="n">cache</span><span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">fn</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">map</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">word</span><span class="p">):</span>
<span class="k">if</span> <span class="n">word</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">word2vec</span><span class="p">:</span>
<span class="k">return</span> <span class="kc">None</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">word2vec</span><span class="p">[</span><span class="n">word</span><span class="p">]</span>
</code></pre></div>
<p>You can use this tool on the driver or in a map function that gets shipped to
the workers. The call to <code>load_word2vec_format()</code> is expensive, but in this
design only executed once in each process.</p>
<p>Example application:</p>
<div class="highlight"><pre><span></span><code><span class="n">filename</span> <span class="o">=</span> <span class="s1">'/dbfs/mnt/some_folder/word2vec_file.bin.gz'</span>
<span class="n">sentence</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'Some'</span><span class="p">,</span> <span class="s1">'sentence'</span><span class="p">,</span> <span class="s1">'as'</span><span class="p">,</span> <span class="s1">'a'</span><span class="p">,</span> <span class="s1">'test'</span><span class="p">]</span>
<span class="n">sc</span><span class="o">.</span><span class="n">parallelize</span><span class="p">(</span><span class="n">sentence</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">Tool</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span><span class="o">.</span><span class="n">map</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span>
</code></pre></div>
<p>There are of course other ways to accomplish this, but I wanted to share the
method that works well for us.</p>
<h2>Summary</h2>
<p>This post gave an introduction to word2vec and showed how to distribute a large
input file to worker nodes on Databricks. It also showed how to create a Mapper
Tool that can cache input data across map jobs in memory.</p>
<p>During this work, I submitted two pull requests to <code>gensim</code>
<a href="https://github.com/piskvorky/gensim/pull/545">#545</a> and
<a href="https://github.com/piskvorky/gensim/pull/555">#555</a> which are merged into the
master branch. With the next release, <code>load_word2vec_format()</code> will be faster.</p>Parallel Processing with pysparkling2015-12-04T00:00:00+01:002015-12-04T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2015-12-04:/blog/parallel-processing-with-pysparkling/<p>Benchmarks for the latest parallel features in pysparkling. It shows good scaling for processing with multiple CPU cores. The example contains only a simple computation which shows that hyperthreading is not very effective in this case.</p><p><code>pysparkling</code> is a pure Python implementation of Apache Spark's RDD interface.
It means you can do <code>pip install pysparkling</code> and start running Spark code in
Python. Its main use is in low latency applications where Spark operations
are applied to small datasets. However, <code>pysparkling</code> also supports the
parallelization of <code>map</code> operations through <code>multiprocessing</code>, <code>ipcluster</code> and
<code>futures.concurrent</code>. This feature is still in development, but this post
explores what is already possible. Bottlenecks that were found while
writing this post are now included in version 0.3.10.</p>
<h2>Benchmark</h2>
<p>I wanted to measure a CPU-bound benchmark to see the overhead of object
serialization in comparison to actual computations. The benchmark function
is a Monte Carlo simulation to calculate the number <span class="math">\(\pi\)</span>. It generates
two uniformly distributed random numbers x and y each between 0 and 1 and
checks whether <span class="math">\(x^2 + y^2 < 1\)</span>. The fraction of tries that satisfy this
condition approximates <span class="math">\(\pi\)</span>/4.</p>
<p>To understand the process better, I instrumented the job execution with timers.
The cumulative time spent in parts like data deserialization on the worker
nodes and a few more are aggregated in the <code>Context._stats</code> variable.</p>
<p>A few problems became apparent:</p>
<ul>
<li>The function that is applied in a map operation is the same for all
partitions of the dataset. In the previous implementation, this function
was serialized separately for every chunk of the data.</li>
<li>Through a nested dependency, all partitions of the data were sent to all
the workers. Now only the partition that a given worker processes is sent to it.</li>
<li>Another slowdown was that core pysparkling functions were not pickle-able.
That is not a problem for <code>cloudpickle</code>, but serializing and deserializing
non-pickle-able functions takes longer. The <code>map()</code> and <code>wholeTextFiles()</code>
methods have pickle-able helpers now.</li>
</ul>
<h2>Results</h2>
<p>The test was run on a 4-core Intel i5 processor and this is the result:</p>
<p><img class="image-process-crisp" src="/images/pysparkling_4cores.png" alt="Speedup with parallel processing on a 4-core Intel i5." /></p>
<p>Achieving a 3x improvement with four cores is good in real world benchmarks.
The new <code>Context._stats</code> variable gives more insight into where time is
actually spent. The numbers below are normalized with respect to the time
spent in the execution of the map function. The results for this CPU bound
benchmark with four processes are:</p>
<ul>
<li>map exec: 100.0%</li>
<li>driver deserialize data: 0.0%</li>
<li>map cache init: 0.2%</li>
<li>map deserialize data: 0.0%</li>
<li>map deserialize function: 2.1%</li>
</ul>
<p>Most of the time is spent in the actual map where it should be. The time it
takes to deserialize the map function is 2.1% of the time it takes to execute
it. The benchmark itself is run as a unit test in
<a href="https://github.com/svenkreiss/pysparkling/blob/master/tests/test_multiprocessing.py#L136">tests/test_multiprocessing.py</a>
and the plots can be recreated with <code>ipython tests/multiprocessing_performance_plot.py</code>.</p>
<p>The test was also run on a 4-core Intel i7 processor with Hyperthreading. You
can see that the performance is slightly better than with the i5, but that the
doubled threads do not double the performance.</p>
<p><img class="image-process-crisp" src="/images/pysparkling_4cores_hyperthreading.png" alt="Speedup with parallel processing on a 4-core Intel i7." /></p>
<p>As a first pass at multiprocessing with pysparkling, this is a good result.
Please check out the project on <a href="https://github.com/svenkreiss/pysparkling">Github</a>,
install it with <code>pip install pysparkling</code> and send feedback.</p>pysparkling Talks2015-08-16T00:00:00+02:002015-08-16T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2015-08-16:/blog/pysparkling-talks/<p>A collection of talks and links on pysparkling at PyGothan and Hack-and-Tell.</p><p><img class="image-process-crisp" src="/images/pysparkling_slide.png" alt="A slide on the basics of pysparkling from the PyGotham talk." /></p>
<ul>
<li>PyGotham (25min):
<a href="http://www.svenkreiss.com/files/pysparkling_at_pygotham_2015.pdf">slides</a>,
<a href="https://www.youtube.com/watch?v=KWxu5xuRtwo">video</a></li>
<li>Hack-and-Tell (5min):
<a href="http://www.svenkreiss.com/files/pysparkling_hack_and_tell.pdf">slides</a></li>
</ul>
<p>Links</p>
<ul>
<li>Documentation:
<a href="http://pysparkling.trivial.io/">pysparkling.trivial.io</a></li>
<li>Github:
<a href="https://github.com/svenkreiss/pysparkling">svenkreiss/pysparkling</a></li>
</ul>pysparkling2015-05-29T00:00:00+02:002015-05-29T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2015-05-29:/blog/pysparkling-initial/<p>A pure Python implementation of Apache Spark's RDD interfaces. pysparkling does not depend on Java and has a small execution overhead. It can be a fast test runner for Spark applications.</p><p><code>pysparkling</code> is a native Python implementation of the interface provided by
Spark’s RDDs. In Spark, RDDs are Resilient Distributed Datasets. An RDD
instance provides convenient access to the partitions of data that are
distributed on a cluster. New RDDs are created by applying transformations
like <code>map()</code> and <code>reduce()</code> to existing RDDs.</p>
<p><code>pysparkling</code> provides the same functionality, but without the dependency on
the Java Virtual Machine, Spark and Hadoop. The original motivation came from
implementing a processing pipeline that is common for machine learning: process
a large number of documents in parallel for training a classification algorithm
(using Apache Spark) and using that trained classification algorithm in an
API endpoint where it is applied to a single document at a time. That single
document has to be preprocessed however with the same transformations that were
also applied during training. This is the task for pysparkling.</p>
<p>Removing the dependency on the JVM, Spark and Hadoop comes at a cost:</p>
<ul>
<li>Hadoop file io is gone, but its core functionality is reimplemented in
<code>pysparkling.fileio</code>. This by itself is very handy as you can read the
contents of files on <code>s3://</code>, <code>http://</code> and <code>file://</code> and optionally with
gzip and bz2 compression just by specifying a file name. The name can
include multiple comma separated files and the wildcards <code>?</code> and <code>*</code>.</li>
<li>Managed resource allocation on clusters is gone (no YARN). Parallel
execution with <code>multiprocessing</code> is supported though.</li>
</ul>
<p>It also comes with some advanced features:</p>
<ul>
<li>Parallelization with any object that has a map() method. That includes
<code>multiprocessing.Pool</code> and <code>concurrent.futures.ProcessPoolExecutor</code>.</li>
<li>It does provide lazy and distributed execution. For example, when creating
an RDD from 50,000 text files with
<code>myrdd = Context().textFile('s3://mybucket/alldata/*.gz')</code> and only accessing
one record with <code>myrdd.takeSample(1)</code>, <code>pysparkling</code> will only download a
single file from S3 and not all 50,000.</li>
</ul>
<h2>Quickstart</h2>
<p>Install pysparkling with <code>pip install pysparkling</code>. As a first example,
count the number of occurrences of every word in <code>README.rst</code>:</p>
<div class="highlight"><pre><span></span><code><span class="kn">from</span> <span class="nn">pysparkling</span> <span class="kn">import</span> <span class="n">Context</span>
<span class="n">counts</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">Context</span><span class="p">()</span><span class="o">.</span><span class="n">textFile</span><span class="p">(</span><span class="s1">'README.rst'</span><span class="p">)</span>
<span class="o">.</span><span class="n">flatMap</span><span class="p">(</span><span class="k">lambda</span> <span class="n">line</span><span class="p">:</span> <span class="n">line</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s1">' '</span><span class="p">))</span>
<span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">word</span><span class="p">:</span> <span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="k">lambda</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">:</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">)</span>
<span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">counts</span><span class="o">.</span><span class="n">collect</span><span class="p">())</span>
</code></pre></div>
<p>More examples including how to explore the Common Crawl dataset and the dataset
of the Human Biome Project are in this
<a href="https://github.com/svenkreiss/pysparkling/blob/master/docs/demo.ipynb">IPython Notebook</a>.</p>
<h2>Further Reading</h2>
<p>Get an overview of the API and more details from
<a href="http://pysparkling.trivial.io">pysparkling's documentation</a>.
If you like this project, <a href="https://github.com/svenkreiss/pysparkling">star it on Github</a>,
tweet about it and follow me, @svenkreiss, on Twitter.</p>Wildcardians on Twitter2015-04-27T00:00:00+02:002015-04-27T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2015-04-27:/blog/wildcardians-on-twitter/<p>A short project to visualize the social Twitter graph of people at Wildcard. The backend is particularly efficient in the number of API calls. The visualization is interactive in d3.js.</p><p>Last Wednesday was Hack Day at Wildcard. This is a small extension of what we had done there:</p>
<p><img class="image-process-crisp" src="/images/wildcardians_on_twitter.png" alt="social graph of people at Wildcard" /></p>
<p>This graph was built from one API call to Twitter per person at Wildcard.
So only 23 API calls. People at Wildcard are represented by blue dots and their
Twitter handle next to it. The size is related to the number of followers.
Orange dots are tweets. Black dots are other Twitter handles that were mentioned.</p>
<p>The real visualization is interactive. You can hover over every tweet and read
its content and hover over every mentioned twitter handle to reveal it.
That way, I discovered a few interesting Twitter accounts that are mentioned
in the tweets.</p>
<p>The backend is a Python script. The front-end is a Databench analysis with a
d3.js visualization.</p>Collaborative Statistical Modeling2015-04-11T00:00:00+02:002015-04-11T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2015-04-11:/blog/collaborative-statistical-modeling/<p>A poster about the collaborative statistical modeling work that Kyle Cranmer and I did and that was used to discover the Higgs boson. We presented this at the opening of the Center for Data Science at NYU.</p><p>This poster was created by <a href="http://theoryandpractice.org/">Kyle Cranmer</a> and me.
It is about the tools we built and were part of the discovery of the Higgs
boson. It’s from a while ago, but it needs more exposure as the work behind it
is applicable outside of physics, but largely unknown.</p>
<p><img class="image-process-crisp" src="/images/nyu_cds_open_poster.png" alt="Poster on collaborative statistical modeling at the opening of the NYU Data Science center" /></p>
<p><a href="/files/nyu_cds_open_poster.pdf">PDF</a> of the poster.</p>
<p>The Higgs group of the ATLAS experiment (one of the two large experiments at
CERN) has a few hundred members working in seven subgroups. The final
statistical test to claim the discovery is done with a combined statistical
model with input models from all subgroups, and in addition models from detector
performance groups and theoretical models from outside the Higgs group. It is
based on statistical methods and technical innovation that deserves more
attention. Outside of particle physics, this is a topic that is gaining
interest, but people are unaware of the experience and technology in
particle physics.</p>
<p>The important part is the separation of model and method. The way collaborative
statistical modeling works at ATLAS is that it is really just the way how
models are built, investigated and debugged. The methods (inference,
generation, confidence intervals, credibility intervals, posterior
probabilities, hypothesis tests, ...) are done by tools that take a model as
input. Any method — no matter whether Frequentist or Bayesian — can be applied
to any model.</p>
<p>Links:</p>
<ul>
<li><a href="https://cds.nyu.edu/projects/collaborative-statistical-modeling/">Web page</a>
about our poster at the opening of the NYU Center for Data Science.</li>
<li><a href="/files/nyu_cds_open_poster.pdf">PDF</a> of the poster.</li>
</ul>PhD Thesis2014-08-17T00:00:00+02:002014-08-17T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2014-08-17:/blog/phd-thesis/<p>Finished my PhD thesis: Higgs Boson Discovery and First Property Measurements using the ATLAS Detector. It summarizes my work over a few years on Higgs Physics with ATLAS and on collaborative statistical modeling.</p><p><em>Higgs Boson Discovery and First Property Measurements using the ATLAS Detector</em></p>
<p>Last May, I finished my PhD in Particle Physics. I had a great time studying physics and doing research at some of the best places: the University of Edinburgh, Scotland, for my bachelor's and master's degrees and New York University for my PhD including a year at CERN in Geneva, Switzerland. I also had great advisors: <a href="http://www.thphys.uni-heidelberg.de/~plehn/">Tilman Plehn</a>, <a href="http://www.physics.carleton.ca/people/faculty-members/thomas-gregoire">Thomas Gregoire</a> and <a href="http://theoryandpractice.org/">Kyle Cranmer</a>.</p>
<p><img class="img-thumbnail float-right" src="/images/phd_higgs_overview.png" width="300" title="famous Higgs overview plot" alt="famous Higgs overview plot">
For my PhD, I was working on the discovery of the Higgs boson. CERN was an amazing place during that time, with the best particle physicists from all over the world working together. I made substantial contributions to the discovery in the <a href="http://atlas.ch/">ATLAS collaboration</a>. I was the first person to combine two search channels and to see the 5σ discovery threshold being breached (<a href="/blog/chasing-the-higgs-nyt/">blog post on the New York Times article</a>). I created the plot on the right that was published in the <a href="http://www.sciencedirect.com/science/article/pii/S037026931200857X">ATLAS discovery paper</a>, <a href="http://science.sciencemag.org/content/sci/338/6114/1576.full.pdf">Science</a> and many other places. I also worked on measuring the Higgs boson mass and the coupling strengths to other particles. A large part of my time was dedicated to statistical modeling and the development of analysis tools, some of which are now part of <a href="https://root.cern.ch">CERN's ROOT data analysis tool</a> and its statistics extension <a href="https://twiki.cern.ch/twiki/bin/view/RooStats/WebHome">RooStats</a>.</p>
<h3>Download Thesis: <a href="/files/phd_thesis.pdf"><i class="fa fa-book fa-lg"></i> PDF</a></h3>Databench2014-06-03T00:00:00+02:002014-06-03T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2014-06-03:/blog/databench-initial/<p>Databench is a data analysis tool using Flask, Socket.IO and d3.js with optional parallelization with Redis Queue and visualization with mpld3.</p><blockquote>
<p>Databench is a data analysis tool using <a href="http://flask.pocoo.org/">Flask</a>, <a href="https://socket.io/">Socket.IO</a> and <a href="https://d3js.org/">d3.js</a> with optional parallelization with <a href="http://python-rq.org/">Redis Queue</a> and visualization with <a href="http://mpld3.github.io/">mpld3</a>. Check out the <a href="http://databench-examples.trivial.io">live demos</a>.</p>
</blockquote>
<p><a href="http://databench-examples.trivial.io"><img class="image-process-crisp top" alt="matplotlib d3 demo" src="/images/mpld3pi_demo_noframe.png" /></a></p>
<p>Seriously, check out the <a href="http://databench-examples.trivial.io">live demos</a>.</p>
<p>All source codes are available on GitHub:</p>
<ul>
<li><a href="https://github.com/svenkreiss/databench">github.com/svenkreiss/databench</a></li>
<li><a href="https://github.com/svenkreiss/databench_examples">github.com/svenkreiss/databench_examples</a></li>
<li><a href="https://github.com/svenkreiss/databench_examples_viewer">github.com/svenkreiss/databench_examples_viewer</a></li>
</ul>
<h2>Motivation</h2>
<p>I like Python for data analysis. However, the frontends for visualization are poor. <code>d3.js</code> is a great library for JavaScript and the web-browser is a powerful user interface. <code>Databench</code> makes Python communicate with the web frontend with minimal effort.</p>
<p>The frontend can be interactive (real-time communication goes both ways between <code>Python</code> and <code>JavaScript</code>/<code>d3.js</code>) and can contain explanatory text and documentation.</p>
<p>To run Databench, you need to install it with <code>pip</code>:</p>
<div class="highlight"><pre><span></span><code>pip install git+https://github.com/svenkreiss/databench.git
</code></pre></div>
<p>(preferably inside a <code>virtualenv</code>). Then you create an <code>analyses</code> folder, run <code>databench</code> on the command line</p>
<div class="highlight"><pre><span></span><code><span class="o">(</span>venv<span class="o">)</span>analysisfolder$ databench
Registering analysis simplepi as blueprint <span class="k">in</span> flask.
Registering analysis slowpi as blueprint <span class="k">in</span> flask.
Registering analysis mpld3pi as blueprint <span class="k">in</span> flask.
Registering analysis mpld3PointLabel as blueprint <span class="k">in</span> flask.
Registering analysis mpld3Drag as blueprint <span class="k">in</span> flask.
Connecting socket.io to simplepi.
Connecting socket.io to slowpi.
Connecting socket.io to mpld3pi.
Connecting socket.io to mpld3PointLabel.
Connecting socket.io to mpld3Drag.
--- databench ---
* Running on http://0.0.0.0:5000/
* Restarting with reloader
Registering analysis simplepi as blueprint <span class="k">in</span> flask.
Registering analysis slowpi as blueprint <span class="k">in</span> flask.
Registering analysis mpld3pi as blueprint <span class="k">in</span> flask.
Registering analysis mpld3PointLabel as blueprint <span class="k">in</span> flask.
Registering analysis mpld3Drag as blueprint <span class="k">in</span> flask.
Connecting socket.io to simplepi.
Connecting socket.io to slowpi.
Connecting socket.io to mpld3pi.
Connecting socket.io to mpld3PointLabel.
Connecting socket.io to mpld3Drag.
--- databench ---
</code></pre></div>
<p>and point your web-browser to <code>http://localhost:5000/</code>.</p>
<h2>Example Analysis: <code>simplepi</code></h2>
<p>Create a project-folder with this structure:</p>
<div class="highlight"><pre><span></span><code><span class="p p-Indicator">-</span> <span class="l l-Scalar l-Scalar-Plain">analyses</span>
<span class="l l-Scalar l-Scalar-Plain">- templates</span>
<span class="l l-Scalar l-Scalar-Plain">- simplepi.html</span>
<span class="l l-Scalar l-Scalar-Plain">- __init__.py</span>
<span class="l l-Scalar l-Scalar-Plain">- simplepi.py</span>
</code></pre></div>
<p>On the command line, all that is necessary is to run <code>databench</code> and the url (usually <code>http://localhost:5000</code>) will be shown that you can open in a web browser.</p>
<p>This is the backend in <code>simplepi.py</code> <em>(updated June 10, 2014)</em>:</p>
<div class="highlight"><pre><span></span><code><span class="sd">"""Calculating \\(\\pi\\) the simple way."""</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="kn">from</span> <span class="nn">time</span> <span class="kn">import</span> <span class="n">sleep</span>
<span class="kn">from</span> <span class="nn">random</span> <span class="kn">import</span> <span class="n">random</span>
<span class="kn">import</span> <span class="nn">databench</span>
<span class="n">simplepi</span> <span class="o">=</span> <span class="n">databench</span><span class="o">.</span><span class="n">Analysis</span><span class="p">(</span><span class="s1">'simplepi'</span><span class="p">,</span> <span class="vm">__name__</span><span class="p">)</span>
<span class="n">simplepi</span><span class="o">.</span><span class="n">description</span> <span class="o">=</span> <span class="vm">__doc__</span>
<span class="n">simplepi</span><span class="o">.</span><span class="n">thumbnail</span> <span class="o">=</span> <span class="s1">'simplepi.png'</span>
<span class="nd">@simplepi</span><span class="o">.</span><span class="n">signals</span><span class="o">.</span><span class="n">on</span><span class="p">(</span><span class="s1">'connect'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">onconnect</span><span class="p">():</span>
<span class="sd">"""Run as soon as a browser connects to this."""</span>
<span class="n">inside</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10000</span><span class="p">):</span>
<span class="n">sleep</span><span class="p">(</span><span class="mf">0.001</span><span class="p">)</span>
<span class="n">r1</span> <span class="o">=</span> <span class="n">random</span><span class="p">()</span>
<span class="n">r2</span> <span class="o">=</span> <span class="n">random</span><span class="p">()</span>
<span class="k">if</span> <span class="n">r1</span><span class="o">*</span><span class="n">r1</span> <span class="o">+</span> <span class="n">r2</span><span class="o">*</span><span class="n">r2</span> <span class="o"><</span> <span class="mf">1.0</span><span class="p">:</span>
<span class="n">inside</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">%</span><span class="mi">100</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">draws</span> <span class="o">=</span> <span class="n">i</span><span class="o">+</span><span class="mi">1</span>
<span class="n">simplepi</span><span class="o">.</span><span class="n">signals</span><span class="o">.</span><span class="n">emit</span><span class="p">(</span><span class="s1">'log'</span><span class="p">,</span> <span class="p">{</span><span class="s1">'draws'</span><span class="p">:</span><span class="n">draws</span><span class="p">,</span> <span class="s1">'inside'</span><span class="p">:</span><span class="n">inside</span><span class="p">})</span>
<span class="n">p</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">inside</span><span class="p">)</span><span class="o">/</span><span class="n">draws</span>
<span class="n">uncertainty</span> <span class="o">=</span> <span class="mf">4.0</span><span class="o">*</span><span class="n">math</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">draws</span><span class="o">*</span><span class="n">p</span><span class="o">*</span><span class="p">(</span><span class="mf">1.0</span> <span class="o">-</span> <span class="n">p</span><span class="p">))</span> <span class="o">/</span> <span class="n">draws</span>
<span class="n">simplepi</span><span class="o">.</span><span class="n">signals</span><span class="o">.</span><span class="n">emit</span><span class="p">(</span><span class="s1">'status'</span><span class="p">,</span> <span class="p">{</span>
<span class="s1">'pi-estimate'</span><span class="p">:</span> <span class="mf">4.0</span><span class="o">*</span><span class="n">inside</span><span class="o">/</span><span class="n">draws</span><span class="p">,</span>
<span class="s1">'pi-uncertainty'</span><span class="p">:</span> <span class="n">uncertainty</span>
<span class="p">})</span>
<span class="n">simplepi</span><span class="o">.</span><span class="n">signals</span><span class="o">.</span><span class="n">emit</span><span class="p">(</span><span class="s1">'log'</span><span class="p">,</span> <span class="p">{</span><span class="s1">'action'</span><span class="p">:</span> <span class="s1">'done'</span><span class="p">})</span>
</code></pre></div>
<p>The analysis waits for the <code>connect</code> signal and then starts an analysis. It provides the frontend with live updates through <code>signals.emit()</code> where some of the <code>emit()</code> messages are for the <code>log</code> window and some are <code>status</code> updates.</p>
<p>The frontend now has to listen to the signals that are emitted by the backend and act on them. The frontend <code>simplepi.html</code> is a <code>jinja2</code> template with math rendered with <a href="https://www.mathjax.org/">MathJax</a> using <code>\( ... \)</code> for inline math and <code>$$ ... $$</code> for display math <em>(updated June 10, 2014)</em>:</p>
<div class="highlight"><pre><span></span><code>{% extends "base.html" %}
{% block title %}simplepi{% endblock %}
{% block content %}
<span class="p"><</span><span class="nt">h1</span><span class="p">></span>
simplepi
<span class="p"><</span><span class="nt">small</span><span class="p">><</span><span class="nt">i</span><span class="p">></span>π = <span class="p"><</span><span class="nt">span</span> <span class="na">id</span><span class="o">=</span><span class="s">"pi"</span><span class="p">></span>0.0 ± 1.0<span class="p"></</span><span class="nt">span</span><span class="p">></</span><span class="nt">i</span><span class="p">></</span><span class="nt">small</span><span class="p">></span>
<span class="p"></</span><span class="nt">h1</span><span class="p">></span>
<span class="p"><</span><span class="nt">p</span><span class="p">></span>This little demo uses two random numbers \(r_1\) and \(r_2\) and
then does a comparison $$r_1^2 + r_2^2 <span class="ni">&le;</span> 1.0$$ to figure out whether
the generated point is inside the first quadrant of the unit circle.<span class="p"></</span><span class="nt">p</span><span class="p">></span>
<span class="p"><</span><span class="nt">pre</span> <span class="na">id</span><span class="o">=</span><span class="s">"log"</span><span class="p">></</span><span class="nt">pre</span><span class="p">></span>
{% endblock %}
{% block footerscripts %}
<span class="p"><</span><span class="nt">script</span><span class="p">></span>
<span class="kd">var</span> <span class="nx">databench</span> <span class="o">=</span> <span class="nx">Databench</span><span class="p">(</span><span class="s1">'simplepi'</span><span class="p">);</span>
<span class="nx">databench</span><span class="p">.</span><span class="nx">genericElements</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">$</span><span class="p">(</span><span class="s1">'#log'</span><span class="p">));</span>
<span class="nx">databench</span><span class="p">.</span><span class="nx">signals</span><span class="p">.</span><span class="nx">on</span><span class="p">(</span><span class="s1">'status'</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">msg</span><span class="p">)</span> <span class="p">{</span>
<span class="nx">$</span><span class="p">(</span><span class="s1">'#pi'</span><span class="p">).</span><span class="nx">text</span><span class="p">(</span>
<span class="nx">msg</span><span class="p">[</span><span class="s1">'pi-estimate'</span><span class="p">].</span><span class="nx">toFixed</span><span class="p">(</span><span class="mf">3</span><span class="p">)</span><span class="o">+</span><span class="s1">' ± '</span><span class="o">+</span>
<span class="nx">msg</span><span class="p">[</span><span class="s1">'pi-uncertainty'</span><span class="p">].</span><span class="nx">toFixed</span><span class="p">(</span><span class="mf">3</span><span class="p">)</span>
<span class="p">);</span>
<span class="p">});</span>
<span class="p"></</span><span class="nt">script</span><span class="p">></span>
{% endblock %}
</code></pre></div>
<p>You may want to extend the Databench <code>base</code> template giving you the header and footer and some standard libraries, but you can also write your own. The <code>block content</code> is the HTML part of the frontend with fields for the results and an explanation about the algorithm. The <code>block footerscripts</code> provides the frontend logic. It wires the <code>log</code> signals to the <code>#log</code> field with <code>databench.genericElements.log($('#log'))</code>. It also starts listening for <code>status</code> signals. When a <code>status</code> signal is received, it executes the callback function where <code>msg</code> contains a JSON representation of the dictionary that the backend sent when emitting <code>status</code>.</p>
<p>And last, to make Databench aware of this analysis, add it to the <code>__init__.py</code>:</p>
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">simplepi</span>
</code></pre></div>
<p>This is all that is necessary to create an analysis in Databench. Now you can run <code>databench</code> in the project-folder and visit <code>http://localhost:5000</code> to run and see the output of the analysis.</p>
<h2>Plotting with <code>matplotlib</code></h2>
<p>If you like Python, but are not too familiar with <code>d3.js</code>, you can use <a href="http://mpld3.github.io/">mpld3</a> to embed your python plots on the web. The <code>mpld3</code> website has a nice gallery of examples that should all work in Databench. Two of them -- one with a standard plugin and one with a custom plugin -- are <code>mpld3PointLabel</code> and <code>mpld3Drag</code> which are both included in the <a href="http://databench-examples.trivial.io">live demos</a> and the <a href="https://github.com/svenkreiss/databench_examples">databench_examples</a> repository.</p>
<h2>Parallelization</h2>
<p>Examples with parallel processing cannot be included in the <a href="http://databench-examples.trivial.io">live demos</a> but are included in the <a href="https://github.com/svenkreiss/databench_examples">databench_examples</a> repository.</p>
<p>The <code>slowpi</code> example contains a demo-implementation of using <a href="">Redis Queue</a> for parallelization. The parallelization is fully implemented on the analysis-side without Databench knowing about it. Other parallelization techniques like <a href="http://www.celeryproject.org/">Celery</a> and <a href="http://www.rabbitmq.com/">RabbitMQ</a> are probably working but are not tested yet.</p>dvds-js version 0.1.02014-04-25T00:00:00+02:002014-04-25T00:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2014-04-25:/blog/dvds-js-v0.1.0/<p>Distributed Versioned Data Structures in JavaScript. Like git in js.</p><h2>This article and dvds-js are outdated :(</h2>
<script src="//cdnjs.cloudflare.com/ajax/libs/d3/3.4.11/d3.min.js" charset="utf-8"></script>
<script src="http://requirejs.org/docs/release/2.1.2/minified/require.js"></script>
<script>
require.config({
paths: {
'crypto-js.SHA3': 'http://crypto-js.googlecode.com/svn/tags/3.1.2/build/rollups/sha3',
'dvds': 'http://svenkreiss.github.io/dvds-js/lib/dvds-0.1.0/dvds.min',
'dvds.visualize': 'http://svenkreiss.github.io/dvds-js/lib/dvds-0.1.0/dvds.min',
},
shim: {
'crypto-js.SHA3': {
exports: 'CryptoJS'
}
}
});
</script>
<blockquote>
<p>Distributed Versioned Data Structures in JavaScript. Like git in js.
Checkout the code on <a href="http://github.com/svenkreiss/dvds-js">github.com/svenkreiss/dvds-js</a>.</p>
</blockquote>
<p>The aim of <code>dvds-js</code> is to have a container (or repository) for data structures in JavaScript that you can <code>fork()</code>, serialize and send over the wire, <code>commit()</code> to and then stream back and <code>merge()</code> with full conflict resolution. Here, <em>data structures</em> means anything that can be serialized with JSON.</p>
<p>This post is about the first development release, version 0.1.0.</p>
<h2>Example</h2>
<p>A repository <code>a</code> is created holding an array with the two names <code>Paul</code> and <code>Adam</code>. Then this repository is forked and the fork is called <code>b</code>. Both <code>a</code> and <code>b</code> are then modified. To demonstrate streaming capabilities, repository <code>b</code> is stringified before and after the manipulation. At the end <code>b</code> is merged into <code>a</code> and the result is shown below.</p>
<div class="highlight"><pre><span></span><code><span class="nx">require</span><span class="p">([</span><span class="s1">'dvds'</span><span class="p">,</span> <span class="s1">'dvds.visualize'</span><span class="p">],</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">a</span> <span class="o">=</span> <span class="ow">new</span> <span class="nx">dvds</span><span class="p">.</span><span class="nx">Repository</span><span class="p">([</span><span class="s1">'Paul'</span><span class="p">,</span><span class="s1">'Adam'</span><span class="p">]);</span>
<span class="kd">var</span> <span class="nx">b</span> <span class="o">=</span> <span class="nx">a</span><span class="p">.</span><span class="nx">fork</span><span class="p">();</span>
<span class="kd">var</span> <span class="nx">bString</span> <span class="o">=</span> <span class="nb">JSON</span><span class="p">.</span><span class="nx">stringify</span><span class="p">(</span><span class="nx">b</span><span class="p">);</span>
<span class="c1">// send bString to a different machine and make it a repository again</span>
<span class="kd">var</span> <span class="nx">bStreamed</span> <span class="o">=</span> <span class="nx">dvds</span><span class="p">.</span><span class="nx">Repository</span><span class="p">.</span><span class="nx">parseJSON</span><span class="p">(</span> <span class="nb">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">bString</span><span class="p">)</span> <span class="p">);</span>
<span class="nx">bStreamed</span><span class="p">.</span><span class="nx">data</span><span class="p">[</span><span class="mf">0</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'Karl'</span><span class="p">;</span>
<span class="nx">bStreamed</span><span class="p">.</span><span class="nx">data</span><span class="p">[</span><span class="mf">1</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'Peter'</span><span class="p">;</span>
<span class="c1">// convert to a string again to send back</span>
<span class="kd">var</span> <span class="nx">bStreamedString</span> <span class="o">=</span> <span class="nb">JSON</span><span class="p">.</span><span class="nx">stringify</span><span class="p">(</span><span class="nx">bStreamed</span><span class="p">);</span>
<span class="c1">// meanwhile on a</span>
<span class="nx">a</span><span class="p">.</span><span class="nx">data</span><span class="p">[</span><span class="mf">0</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'Paula'</span><span class="p">;</span>
<span class="c1">// receive the modified b repository</span>
<span class="kd">var</span> <span class="nx">bReceived</span> <span class="o">=</span> <span class="nx">dvds</span><span class="p">.</span><span class="nx">Repository</span><span class="p">.</span><span class="nx">parseJSON</span><span class="p">(</span> <span class="nb">JSON</span><span class="p">.</span><span class="nx">parse</span><span class="p">(</span><span class="nx">bStreamedString</span><span class="p">)</span> <span class="p">);</span>
<span class="nx">a</span><span class="p">.</span><span class="nx">merge</span><span class="p">(</span><span class="nx">bReceived</span><span class="p">);</span>
<span class="c1">// update html output</span>
<span class="nx">$</span><span class="p">(</span><span class="s2">"#test1Out"</span><span class="p">).</span><span class="nx">text</span><span class="p">(</span><span class="nb">JSON</span><span class="p">.</span><span class="nx">stringify</span><span class="p">(</span><span class="nx">a</span><span class="p">.</span><span class="nx">data</span><span class="p">));</span>
<span class="c1">// visualize</span>
<span class="nx">dvds</span><span class="p">.</span><span class="nx">visualize</span><span class="p">.</span><span class="nx">CommitGraph</span><span class="p">(</span><span class="nx">d3</span><span class="p">.</span><span class="nx">select</span><span class="p">(</span><span class="s1">'#test1Graph'</span><span class="p">))(</span><span class="nx">a</span><span class="p">);</span>
<span class="nx">dvds</span><span class="p">.</span><span class="nx">visualize</span><span class="p">.</span><span class="nx">CommitGraph</span><span class="p">(</span><span class="nx">d3</span><span class="p">.</span><span class="nx">select</span><span class="p">(</span><span class="s1">'#test2Graph'</span><span class="p">))(</span><span class="nx">bReceived</span><span class="p">);</span>
<span class="p">});</span>
</code></pre></div>
<p><strong>Live output</strong>: <span id="test1Out">?</span></p>
<p>Edit on <a href="http://jsfiddle.net/3Ruat/11/">http://jsfiddle.net/3Ruat/11/</a>.</p>
<h3>Graph of Commits</h3>
<p>Repositories are created with commit 0 shown on the left and then develop towards the right with the last commit on the far right. The second graph shows a merge of <code>a</code> and <code>b</code> as the last commit. This is a live visualization of the two repositories in the example.</p>
<p>Repository <code>b</code>:</p>
<p><svg height="150" width="600" id="test2Graph"></svg></p>
<p>Repository <code>a</code> merged with <code>b</code>:</p>
<p><svg height="150" width="600" id="test1Graph"></svg></p>
<h2>Features</h2>
<ul>
<li>special merge algorithms for nested arrays and objects (e.g. arrays inside of objects inside of arrays inside of an object)</li>
<li>commit hash is built over the commit's data, but also over the entire parent-tree which means that the commit id can validate the entire parent-tree</li>
<li>a repository exposes the <code>data</code> member that behaves like a normal js variable (e.g. can be used in <code>angular.js</code> directly)</li>
<li>visualization (currently only <code>CommitGraph</code>) is factored into its own submodule <code>visualize</code></li>
<li>unit tests run with <code>Jasmine</code> and <code>Karma</code>, <code>jscs</code> is used to check code style, <code>uglify</code> is used to build min version and automation is done with <code>grunt</code></li>
</ul>
<h2>Setup</h2>
<p><code>dvds-js</code> is an <a href="http://requirejs.org/docs/whyamd.html#amd">AMD library</a>. You can load it using <code>require-js</code> in the browser as in the example above. The setup looks something like this:</p>
<div class="highlight"><pre><span></span><code><span class="o"><</span><span class="nx">script</span> <span class="nx">src</span><span class="o">=</span><span class="s2">"http://s3.amazonaws.com/flaskApp_static/static/d3/d3.v3.min.js"</span> <span class="nx">charset</span><span class="o">=</span><span class="s2">"utf-8"</span><span class="o">><</span><span class="err">/script></span>
<span class="o"><</span><span class="nx">script</span> <span class="nx">src</span><span class="o">=</span><span class="s2">"http://requirejs.org/docs/release/2.1.2/minified/require.js"</span><span class="o">><</span><span class="err">/script></span>
<span class="o"><</span><span class="nx">script</span><span class="o">></span>
<span class="nx">require</span><span class="p">.</span><span class="nx">config</span><span class="p">({</span>
<span class="nx">paths</span><span class="o">:</span> <span class="p">{</span>
<span class="s1">'crypto-js.SHA3'</span><span class="o">:</span> <span class="s1">'http://crypto-js.googlecode.com/svn/tags/3.1.2/build/rollups/sha3'</span><span class="p">,</span>
<span class="s1">'dvds'</span><span class="o">:</span> <span class="s1">'http://svenkreiss.github.io/dvds-js/lib/dvds-0.1.0/dvds.min'</span><span class="p">,</span>
<span class="s1">'dvds.visualize'</span><span class="o">:</span> <span class="s1">'http://svenkreiss.github.io/dvds-js/lib/dvds-0.1.0/dvds.min'</span><span class="p">,</span>
<span class="p">},</span>
<span class="nx">shim</span><span class="o">:</span> <span class="p">{</span>
<span class="s1">'crypto-js.SHA3'</span><span class="o">:</span> <span class="p">{</span>
<span class="nx">exports</span><span class="o">:</span> <span class="s1">'CryptoJS'</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">});</span>
<span class="o"><</span><span class="err">/script></span>
</code></pre></div>
<p>This includes <code>d3.js</code> for visualizations and <code>CryptoJS</code> is needed for calculating unique identifiers for commits.
In <code>node.js</code>, this setup is not necessary and you would simply use <code>require()</code>.</p>
<h2>Appendix: Static image of commit graphs</h2>
<p><img src="/images/dvds-js-v010-commitgraphs.png" width="500" title="Commit graphs of dvds-js example." alt="Commit graphs of dvds-js example."></p>
<script>
require(['dvds', 'dvds.visualize'], function() {
var a = new dvds.Repository(['Paul', 'Adam']);
var b = a.fork();
var bString = JSON.stringify(b);
// send bString to a different machine and make it a repository again
var bStreamed = dvds.Repository.parseJSON(JSON.parse(bString));
bStreamed.data[0] = 'Karl';
bStreamed.data[1] = 'Peter';
// convert to a string again to send back
var bStreamedString = JSON.stringify(bStreamed);
// meanwhile on a
a.data[0] = 'Paula';
// receive the modified b repository
var bReceived = dvds.Repository.parseJSON(JSON.parse(bStreamedString));
a.merge(bReceived);
// update html output
$("#test1Out").text(JSON.stringify(a.data));
// visualize
dvds.visualize.CommitGraph(d3.select('#test1Graph'))(a);
dvds.visualize.CommitGraph(d3.select('#test2Graph'))(bReceived);
});
</script>Vimeo liquid tag for Pelican2014-03-07T04:41:00+01:002014-03-07T04:41:00+01:00Sven Kreisstag:www.svenkreiss.com,2014-03-07:/blog/pelican-vimeo/<p>Extend liquid tags plugin for Pelican to include a Vimeo tag.</p><p>Testing my implementation of the <code>vimeo</code> tag for <code>liquid_tags</code>. This is based on the <code>youtube</code> tag which in turn is based on the <a href="https://gist.github.com/jamieowen/2063748">jekyll / octopress youtube tag</a>.</p>
<p>The syntax is the same as for the <code>youtube</code> tag:</p>
<div class="highlight"><pre>
{% vimeo id [width height] %}
</pre></div>
<p><em>Update</em>: The code is now merged into the main pelican-plugins repository on github:
<a href="https://github.com/getpelican/pelican-plugins">https://github.com/getpelican/pelican-plugins</a></p>
<h2>Tests with different sizes</h2>
<p><span class="videobox">
<iframe
src="//player.vimeo.com/video/21789576?title=0&byline=0&portrait=0"
width="320" height="180" frameborder="0"
webkitAllowFullScreen mozallowfullscreen allowFullScreen>
</iframe>
</span></p>
<p><span class="videobox">
<iframe
src="//player.vimeo.com/video/21789576?title=0&byline=0&portrait=0"
width="480" height="270" frameborder="0"
webkitAllowFullScreen mozallowfullscreen allowFullScreen>
</iframe>
</span></p>
<p><span class="videobox">
<iframe
src="//player.vimeo.com/video/21789576?title=0&byline=0&portrait=0"
width="640" height="360" frameborder="0"
webkitAllowFullScreen mozallowfullscreen allowFullScreen>
</iframe>
</span></p>morphDemo2014-03-06T10:10:00+01:002014-03-06T10:10:00+01:00Sven Kreisstag:www.svenkreiss.com,2014-03-06:/blog/morph-demo/<p>Demo of a new horizontal morphing algorithm.</p><p><img class="img-thumbnail float-right" src="/images/kdtreemorph_preview.png" width="350" title="preview of kd-tree morphing" alt="preview of kd-tree morphing">
<a href="/files/morphDemo.html">This</a> is an interactive demo for a new morphing algorithm with special properties that are motivated from Physics. It uses KD trees and kernel density estimates that are calculated in real time in this demo. All visualization is done using <code>d3.js</code> and custom code for KD trees and kernel densities in JavaScript.</p>
<p>Link: <a href="/files/morphDemo.html">morphDemo.html</a></p>Chasing the Higgs - New York Times2014-03-01T02:00:00+01:002014-03-01T02:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2014-03-01:/blog/chasing-the-higgs-nyt/<p>My part of the Higgs discovery story in the New York Times.</p><p>My part of the Higgs discovery story in the New York Times.</p>
<p><img src="/images/nyt_science_front_page.jpeg" width="200" title="Science Times section front page" alt="Science Times section front page">
<img src="/images/nyt_science_my_part.jpeg" width="350" title="Chasing the Higgs, my part" alt="Chasing the Higgs, my part"></p>
<p>That "joyful expletive" was "Holly shit" (not correcting the typo) sent from his phone.<br />
Read the full story on the <a href="http://www.nytimes.com/2013/03/05/science/chasing-the-higgs-boson-how-2-teams-of-rivals-at-CERN-searched-for-physics-most-elusive-particle.html?view=Opening_the_Box">New York Times website</a> from March 5, 2013.</p>A Nobel Prize Party: Cheese, Bubbles, and a Boson - The New Yorker2013-10-10T08:00:00+02:002013-10-10T08:00:00+02:00Sven Kreisstag:www.svenkreiss.com,2013-10-10:/blog/nobel-prize-party-new-yorker/<p>I got mentioned in the New Yorker article: A Nobel Prize Party: Cheese, Bubbles, and a Boson.</p><p>A funny and only approximately accurate article about how we celebrated the Nobel Prize for Peter Higgs and François Englert at NYU in the New Yorker:</p>
<p><img alt="New Yorker article about the party at the NYU Physics Department" src="/images/new_yorker_nobel_prize_party.png"></p>
<blockquote>
<p>Sven Kreiss, Cranmer’s graduate student, was the first to see the statistical evidence needed to claim the discovery in, June, 2012. He told me in a strong German accent that they don’t host a lot of parties in the physics lounge. “We are very serious here,” he said.</p>
<p>Kreiss has a little goatee and was wearing a black T-shirt with an unzipped grey sweatshirt. He remained stoic as he recalled the moment, at CERN, the research center in Geneva, when he saw their research cross the finish line, confirming the particle’s existence. Kreiss was working on ATLAS (A Toroidal L.H.C. Apparatus), one of seven experiments being conducted at the Large Hadron Collider, and was on one of two detector teams going after the Higgs boson. “It’s a graph,” Kreiss said of what he saw at the time. “It has some lines. The line, it goes down like this”—he swooped his hand down—“and if the line goes down far enough, then you say you’ve discovered a new particle.” He shrugged.</p>
<p>...</p>
<p>Kreiss didn’t immediately think that the finding was Nobel-worthy. “It was combined with a lot of exhaustion,” he said. “You’re tired, you think about this, you go out and come back in. Actually, I had a good night’s sleep for the first time in a while. And then, in the morning, I came back and e-mailed this to my professor. It was his birthday, so I said, ‘Happy birthday.’ ” That was June 25, 2012. Cranmer can remember how excited he was to receive the note. “I wrote back, ‘Holy shit,’ ” he said. “But I misspelled ‘holy.’ Too many ‘L’s.”</p>
</blockquote>
<p>The full story is here: <a href="http://www.newyorker.com/tech/elements/a-nobel-prize-party-cheese-bubbles-and-a-boson">http://www.newyorker.com/tech/elements/a-nobel-prize-party-cheese-bubbles-and-a-boson</a></p>My robot Number4 in the Leipziger Volkszeitung2004-01-01T00:00:00+01:002004-01-01T00:00:00+01:00Sven Kreisstag:www.svenkreiss.com,2004-01-01:/blog/number4-lvz/<p>Student from Brandis (me) teaches robot how to walk.</p><p>Newspaper article about the competition Jugend Forscht. In 2004 this project got me the first place in the regional and state competition. The photo was taken at the national competition in Saarbruecken.</p>
<p><img src="/images/number4/lvz.jpg" title="number4 in LVZ" alt="number4 in LVZ"></p>