I am a post-doc at VITA lab at EPFL university.
Good software engineering practices are important to me. They are foundational for open and reproducible science. You -- a current or future member of a Machine Learning lab -- are not necessarily expected to be a computer scientist or software engineer, but the tools must be familiar. Expert-level familiar. You will use these tools most of the day. You should not be clumsy with or afraid of these tools. Below is an ambitious list, so don't worry if you are not familiar with most of it yet.
Python Core: The best way to get started and a great resource for advanced users is the tutorial on the official Python webpage. Here are my notes more specific to ML.
- always use Python3 now, not Python2
- read Python code of great projects outside of ML (almost no ML projects are written well, see closing remarks):
- Python core and standard libraries (always available, so use them):
- list comprehensions
- classes: when your function becomes too long it might want to be a class
- sets: avoid testing whether an element is in a list, test whether it is in a set
- argparse, use a standard library to parse command line arguments
defaultdictfor sparse accumulators
- generators and
yield, see functional programming howto
functools.lru_cacheis a powerful tool
propertydecorator is great for adding caches
- more advanced but useful: context managers, decorators
- understand stack traces (the basics: the line above is where the line below originated), Wikipedia
- scientific foundational packages:
scikit-learnand depending on your focus
Packaging: As soon as you need to re-use code in a second file, you will want to import one of your own files. You should use relative imports and to use those properly you need to package your code. Packaging in Python used to be a mess, but much progress has been made over recent years. When you search for help related to packaging, filter for results within the last year.
pip, I have never used
- relative imports: see tutorial section on Modules
venv: I never do anything outside a virtual environment
- Jupyter notebooks: Great for demos and log books. Never to code.
- mybinder.org to make your Jupyter demos interactive
Style / Unit Tests / Continuous Testing: Standard practices for software engineers are useful for ML projects, too. They help. Beyond testing code, unit tests in combination with continuous integration give a robust and reproducible starting point for anyone picking up your project (including yourself in one year).
- PEP8 is the Style Guide for Python and contains explanations. Don't skip A Foolish Consistency is the Hobgoblin of Little Minds. Still, I follow almost all of PEP8 to the letter. Additional:
- no abbreviations for variables, classes, functions, etc.
- do not iterate over indices in Python: don't do
for i in range(len(mylist))
pylintgenerally provides good advice beyond checking for PEP8
pytestto run all your tests
- CircleCI and TravisCI can automatically run your tests on every commit to git
Red Flags: When browsing open source code, this is when I get worried.
- no eval metrics/scripts for that implementation
- have to change the
- have to copy or symlink folders with code
Closing: Infrastructures at large companies are different from a university lab and your laptop: containers, distributed storage, build systems, custom DNS, mono repositories. A small piece in the middle of the stack is open source and has been duct-taped to make it work without the rest of the infrastructure. Don't follow blindly what seems like a standard practice of the best software engineers in the world. It might be an artifact of porting it to the open source world.
Feedback is welcome on Twitter @svenkreiss.