Python at a University Lab

I am a post-doc at VITA lab at EPFL university.

Good software engineering practices are important to me. They are foundational for open and reproducible science. You -- a current or future member of a Machine Learning lab -- are not necessarily expected to be a computer scientist or software engineer, but the tools must be familiar. Expert-level familiar. You will use these tools most of the day. You should not be clumsy with or afraid of these tools. Below is an ambitious list, so don't worry if you are not familiar with most of it yet.

Python Core: The best way to get started and a great resource for advanced users is the tutorial on the official Python webpage. Here are my notes more specific to ML.

always use Python3 now, not Python2
read Python code of great projects outside of ML (almost no ML projects are written well, see closing remarks): start with requests
Python core and standard libraries (always available, so use them):
- list comprehensions
- classes: when your function becomes too long it might want to be a class
- sets: avoid testing whether an element is in a list, test whether it is in a set
- argparse, use a standard library to parse command line arguments
- logging
- defaultdict for sparse accumulators
- generators and yield, see functional programming howto
- all itertools (e.g. itertools.chain.from_iterable)
- functools.lru_cache is a powerful tool
- property decorator is great for adding caches
- more advanced but useful: context managers, decorators
understand stack traces (the basics: the line above is where the line below originated), Wikipedia
scientific foundational packages: numpy, scipy, scikit-learn and depending on your focus cython

Packaging: As soon as you need to re-use code in a second file, you will want to import one of your own files. You should use relative imports and to use those properly you need to package your code. Packaging in Python used to be a mess, but much progress has been made over recent years. When you search for help related to packaging, filter for results within the last year.

pip, I have never used conda
relative imports: see tutorial section on Modules
setup.py: see setuptools basic use
venv: I never do anything outside a virtual environment
Jupyter notebooks: Great for demos and log books. Never to code.
mybinder.org to make your Jupyter demos interactive

Style / Unit Tests / Continuous Testing: Standard practices for software engineers are useful for ML projects, too. They help. Beyond testing code, unit tests in combination with continuous integration give a robust and reproducible starting point for anyone picking up your project (including yourself in one year).

PEP8 is the Style Guide for Python and contains explanations. Don't skip A Foolish Consistency is the Hobgoblin of Little Minds. Still, I follow almost all of PEP8 to the letter. Additional:
- no abbreviations for variables, classes, functions, etc.
- do not iterate over indices in Python: don't do for i in range(len(mylist))
pylint generally provides good advice beyond checking for PEP8
pytest to run all your tests
CircleCI and TravisCI can automatically run your tests on every commit to git

Red Flags: When browsing open source code, this is when I get worried.

no eval metrics/scripts for that implementation
have to change the PATH or PYTHONPATH variables
have to copy or symlink folders with code

Closing: Infrastructures at large companies are different from a university lab and your laptop: containers, distributed storage, build systems, custom DNS, mono repositories. A small piece in the middle of the stack is open source and has been duct-taped to make it work without the rest of the infrastructure. Don't follow blindly what seems like a standard practice of the best software engineers in the world. It might be an artifact of porting it to the open source world.

Feedback is welcome on Twitter @svenkreiss.

Go Top

Sven Kreiss

Python at a University Lab

Related Posts