Friday, November 25, 2016

Python Import the world Anti-pattern!

I've been recently beaten from that anti-pattern which seems to be way too common in Python.

Now, what exactly is that "import the world" anti-pattern?

This anti-pattern (which I just made up) is characterized by importing lots of other files in the top-level of your scope of your module or package or doing too much at import time.

Now, you may ask: why is it bad? Every code I see in the Python world seems to be structured like that...

It's pretty simple actually: in Python, everything is dynamic, so, when you import a module you're actually making the Python interpreter load that file and run the bytecode, which will in turn generate classes, methods (and do anything which is in the global scope of your module or class definition)..

-- sure, even worse would be at import time going on to connect to some database or do other nasty stuff you wouldn't be expecting by just importing a module -- or who knows, going on to register to another service just because you imported some module! Importing code should be mostly free of side effects, besides, you know, generating those classes, methods and putting the module on sys.modules.

Ok, ok, I deviated from the main topic: why is it so bad having all those imports at the top-level of your module?

The reason is simple: nobody wants a gazillion of dependencies just because they partially imported some module for some simple operation.

It's slow, so much that any command line application that passed the toy stage -- and are concerned about the user experience -- have to hack around it.. really, just ask the mercurial guys.

It wastes memory (why loading all those modules if they won't be used anyways).

It adds dependencies which wouldn't be needed in the first place (like, you have a library which needs a lapack implementation which needs some ancient incarnations to be compiled that I don't care about because I won't be using the functions that need it in the first place).

It makes testing just a part of your code much slower (i.e.: you'll load 500 modules just for a small unit-test which touches just a small portion of your code).

It makes testing with pytest-xdist much slower (because it'll import all the code in all of its slaves instead of loading just what would be needed for a given worker).

So, please, just don't.

Ok, but does that really happen in practice?

Well, let me show you the examples I have stumbled in the last few days:

1. conda: Conda is a super-nice command line application to manage virtual environments and get dependencies. But let's take a look under the hood:

The main thing you use in conda is the command line, so, let's say you want to play with the "conda_env.cli.main" module. How long and how much does a simple: "from conda_env.cli import main"?

Let's see:

>>> sys.path.append('C:\\Program Files\\Brainwy\\PyVmMonitor 1.0.1\\public_api')
>>> import pyvmmonitor
>>> pyvmmonitor.connect()
>>> len(sys.modules)
>>> 123
>>> @pyvmmonitor.profile_method
... def check():
...     from conda_env.cli import main
...
>>> check()
>>> import sys
>>> len(sys.modules)
451

And it generates the following call graph:



Now, wait, what just happened? Haven't you just imported a module? Yes... I have, and in turn it has taken 0.3 seconds, loaded its configuration file under the hood (and made up some kind of global state?), parsed yaml, and imported lots of other things in turn (which I wish never happened) -- and it'd be even worse if you did a "conda env" command because it imports lots of stuff, parses the arguments and then decides to call a new command line with subprocess with "conda-env" and goes on to do everything again (see https://github.com/conda/conda/blob/85e52ebfe88c3e68f7cc5db699a8f4c450400c4b/conda/cli/conda_argparse.py#L150).

2. mock: Again, this is a pretty nice library (so much that it was added to the standard library on Python 3), but still, what do you expect from "import mock"?

Let's see:

>>> sys.path.append('C:\\Program Files\\Brainwy\\PyVmMonitor 1.0.1\\public_api')
>>> import pyvmmonitor
>>> pyvmmonitor.connect()
>>> len(sys.modules)
>>> 123
>>> @pyvmmonitor.profile_method
... def check():
...    import mock
...
>>> check()
>>> import sys
>>> len(sys.modules)
291

And it generates the following call graph:



Ok, now there are less deps, but the time is roughly the same. Why? Because to define its version, instead of doing:

__version__ = '2.0.0'

it did:

from pbr.version import VersionInfo

_v = VersionInfo('mock').semantic_version()
__version__ = _v.release_string()
version_info = _v.version_tuple()

And that went on to inspect lots of things system wide, including importing setuptools, which in turn parsed auxiliary files, etc... definitely not what I'd expect when importing some library which does mocks (really, setuptools is for setup time, not run time).

Now, how can this be solved?

Well, the most standard way is not putting the imports in the top-level scope. Just use local imports and try to keep your public API simple -- simple APIs are much better than complex APIs ;)

Some examples: if you have some library which wants to export some methods at some __init__ package, don't import the real implementations and put them in the namespace, just define the loads and dumps as methods in __init__ and make local imports which will do the real work loading the actual implementation as lazy imports inside those methods (or do import it, but then, make sure that the modules that contain the loads and dumps don't have global imports themselves).

Classes may be the more tricky case if you want them to be the bases for others to implement (because in this case you really need the class at the proper place for users of your API), so, here, you can explicitly import that class to be available in your __init__, but then, make sure that the scope which uses that class will import only what's needed to define that class, not to use it.

Please, don't try to use tricks such as https://github.com/bwesterb/py-demandimport for your own library... it just complicates the lives of everyone that wants to use it (and Python has the tools for you to work with that problem without having to resort to something which will change how Python imports behave globally) and don't try to load some global state behind the scenes (explicit is way better than implicit).

Maybe the ideal would be having Python itself do all imports lazily (but that's probably impossible right now) or have some support for a statement such as from xxx lazy_import yyy, so that you could just shove everything at your top-level, but until then, you can resort to using local imports -- note: you could still can create your own version of the lazy import which you could use to put things in the global scope, but as it's not standard, IDEs and refactoring tools may not always recognize it, so, I'd advise against it too given that local imports do work properly (although if you want to do some global registry of some kind, register as strings to be lazily loaded when needed instead of importing modules/classes to fill up your registry).

Really, this is not new: https://files.bemusement.org/talks/OSDC2008-FastPython ;)


3 comments:

Bruno said...

Wholeheartedly agree!

That reminded me: some two weeks ago or so I was trying to optimize test execution time in one of our projects (souring). A simple test which only created the main application model was taking 8 seconds, so I was trying to find what was taking so long. After taking a look at the call graph and noticed that most of the calls were to __import__, decided to test this:

python -c "import souring.model._tests.test_module"

(just import the test, not even involving pytest)

How long that took? 7.6 seconds. :(

To be fair, I think my computer needs to be formatted because I get the feeling it is too slow for most simple operations, but even then ~95% of the time being importing modules is a pain.

Roberto Liffredo said...

Last year I was playing quite a bit with IronPython, and IIRC, that "trick" about versioning was a major reason hy mock was not running there.
It was really frustrating...

Fabio Zadrozny said...

As a note, this was reported at: https://github.com/testing-cabal/mock/issues/385