Subtleties of Python

In our pro­fes­sion, at­ten­tion to de­tail is of ut­most im­por­tance. Any good soft­ware en­gi­neer un­der­stands that de­tails mat­ter a lot, as they make the dif­fer­ence be­tween a work­ing unit or a dis­as­ter 1.

This is why clean code is not just about for­mat­ting or ar­rang­ing the code. It’s nei­ther a foible. It is in­stead, pay­ing at­ten­tion ex­act­ly to those de­tails that will make a big dif­fer­ence in pro­duc­tion.

Let’s see some ex­am­ples of this in Python.

Dis­claimer: The fol­low­ing cas­es be­ing pre­sent­ed are just ex­am­ples. They are not meant to be in­ter­pret­ed as pat­tern­s, in the sense of “ev­ery time you find X, ap­ply Y”.

Be­low there are two short cas­es of prob­lem­at­ic code I have found in whilst work­ing on past code bases.

Mutable Objects and Attributes

This is one of the main rea­sons why I like func­tion­al pro­gram­ming: im­mutabil­i­ty.

In func­tion­al soft­ware, we would­n’t be talk­ing about mu­ta­tion, be­cause such a thing is sim­ply not al­lowed. But I di­gress. The point is that in most of the soft­ware de­fined with ob­ject­s, they mu­tate, they change their in­ter­nal state or rep­re­sen­ta­tion. It’s true that we could have im­mutable ob­ject­s, but that’s usu­al­ly not the case.

Con­sid­er the fol­low­ing code (spoil­er: there is some­thing re­al­ly wrong with it).

sub­tleties0.py (Source)

"""Mutable objects as class attributes."""
import logging

logger = logging.getLogger(__name__)


class Query:
    PARAMETERS = {"limit": 100, "offset": 0}

    def run_query(self, query, limit=None, offset=None):
        if limit is not None:
            self.PARAMETERS.update(limit=limit)
        if offset is not None:
            self.PARAMETERS.update(offset=offset)
        return self._run(query, **self.PARAMETERS)

    @staticmethod
    def _run(query, limit, offset):
        logger.info("running %s [%s, %s]", query, limit, offset)

When try­ing to use dic­tio­nar­ies to pass key­word ar­gu­ments, it might be tempt­ing to mu­tate them in or­der to adapt it to the sig­na­ture of the func­tion we want to cal­l. But be­fore mu­tat­ing an ob­jec­t, we should think on it’s scope and the con­se­quences this mu­ta­tion will bring.

In this case, the dic­tio­nary that the method mu­tates be­longs to the class. Chang­ing it, will change the class. Af­ter it has been mu­tat­ed, the de­fault val­ues will re­main from the last time it was up­dat­ed. More­over, since this dic­tio­nary be­longs to the class, all in­stances (in­clud­ing new ones), will car­ry on with this:

>>> q = Query()
>>> q.run_query("select 1")
running select 1 [100, 0]

>>> q.run_query("select 1", limit=50)
running select 1 [50, 0]

>>> q.run_query("select 1")
running select 1 [50, 0]

>>> q.PARAMETERS
{'limit': 50, 'offset': 0}

>>> new_query = Query()
>>> new_query.PARAMETERS
{'limit': 50, 'offset': 0}

As we can ob­serve, this is not what we wan­t, and it’s ex­treme­ly frag­ile.

Here are some gen­er­al rec­om­men­da­tion­s:

  1. Don’t change mu­­ta­ble ob­­jects passed by pa­ram­e­ter to func­­tion­s. Cre­ate new copies of the ob­­jects when­ev­er pos­si­ble, and re­­turn them ac­­cord­ing­­ly.

  2. Don’t mu­­tate class at­tributes.

  3. Try not to set mu­­ta­ble ob­­jects as class at­tributes.

Need­less to say, there are ex­cep­tions to these rules, af­ter all prag­ma­tism beats pu­ri­ty. Here are some ob­vi­ous ex­cep­tions to these rules 2: :

  • For item 1, we al­ways have to con­sid­er the trade-off mem­o­ry/speed. If the ob­ject it’s too big (per­haps a large dic­tio­nary), run­ning copy­.deep­copy­() on it will be slow, and it will take a lot of mem­o­ry, so it’s prob­a­bly faster to just mod­i­fy it in place.

  • The ob­vi­ous ex­­cep­­tion for rule [2] is when us­ing de­scrip­­tors, but be­­cause it’s like­­ly an sce­­nario on which we are coun­t­ing on that side ef­fec­t. Oth­­er than that, there should­n’t be any rea­­son to go on such a dan­ger­ous path.

  • Rule [3] should­n’t be a prob­lem if the at­tributes are read­­-on­­ly, and we’re sure they are nev­er go­ing to be mu­­tat­ed. In that case, set­t­ing dic­­tio­­nar­ies and lists as class at­tributes, might be fine, but I’m on­­ly wor­ried about the fact that even though you can be sure that right now there is no method mu­­tat­ing them, you can’t as­­sure no­­body will break that rule in the fu­­ture.

Iterators

The it­er­a­tor pro­to­col in Python is a great fea­ture. It al­lows us to treat an en­tire set of ob­jects by their be­hav­iour re­gard­less of their in­ter­nal rep­re­sen­ta­tion.

For in­stance we can write a code like:

for i in myiterable: ...

And we don’t know exactly what myiterable is. It might be a list, a tuple, a dictionary, a string, and it will still work.

We can al­so re­ly on all meth­ods that use this pro­to­col,

mylist.extend(myiterable)

Of course mylist has to be a list, but again myiterable can be anything that can be iterated over.

Un­for­tu­nate­ly, this amaz­ing fea­ture it’s al­so the cause of some sub­tle headaches.

A long time ago, I re­mem­ber hear­ing a com­plaint from a co-­work­er say­ing that a part of the code was par­tic­u­lar­ly slow. The code in ques­tion was a func­tion that was sup­posed to process and then move files to a giv­en tar­get di­rec­to­ry. You can imag­ine some­thing like this:

def process_files(files_to_process, target_directory):
    for file_ in files_to_process:
        # ...
        shutil.copy2(file_, target_directory)

Now what happens? Like in the introduction, we are using the iterator protocol, so we rely on the fact that we don’t exactly know what files_to_process is exactly (a tuple, a list, etc.)

There is a subtle issue. Strings are also iterable. If you pass a single file, let’s say /home/ubuntu/foo, each character will be iterated over, starting with / (the root directory), following with h, etc. That’s why the programs was slow. It was copying the entire file system!

The so­lu­tion is to use a bet­ter in­ter­face, that dis­al­low these er­rors en­tire­ly:

def process_files(*files_to_process, target_directory):
    for file_ in files_to_process:
        # ...
        shutil.copy2(file_, target_directory)

In this example the signature of the function exposes a much nicer interface, in the sense that it can allow one or multiple files as arguments, without the issue of the previous example. This change also makes target_directory to be keyword-only, which is more explicit as well.

Closing words

I hope you en­joyed the con­tent, and that you got an idea of how crit­i­cal some de­tails can be.

The re­al val­ue of clean code is to avoid dis­as­ter­s. Top­ics like these ones (and more), are cov­ered in my lat­est ti­tle, should the read­er be in­ter­est­ed.

1

I’m not ex­ag­ger­at­ing when I say dis­as­ter. There are many fa­mous ex­am­ples of crash­es in soft­ware as a re­sult of edge cas­es, or er­rors that can be nar­rowed down to a sin­gle line of code.

2

The list is by no means ex­haus­tive.