Friday, January 17, 2025

Using (or really misusing) Path.resolve() in Python

 

I just stumbled upon a code in Python that uses a simple `Path(...).resolve()` after receiving a path in an API and suddenly reminded myself on the many the bugs I've tripped over due to it, so, decided to share  why I hope at some point this function stops being so pervasive as it's usually the wrong thing to call.. (maybe because when Python went from os.path to pathlib they added no real replacement and now you need to use both).

The main reason of bugs for that is that when you call Path.resolve() it'll actually resolve symlinks (and substs on Windows).

This means that if the user crafted some structure as

/myproject

/myproject/package.json

/myproject/src -> symlink to sources in /allmysources/temp

 If you resolve `/myproject/src/myfile.py` to `/allmysources/temp/myfile.py` and actually want to get to the `package.json`, it'll never be found!

 I was even looking at the python docs and found this nugget when talking about `Path.parent`:

If you want to walk an arbitrary filesystem path upwards, it is recommended to first call Path.resolve() so as to resolve symlinks and eliminate ".." components.

And that's completely why you shouldn't use it because if you do, then the parents will be completely off!

So, what should you do?

Unfortunately pathlib is not your friend here, you need to use `os.path.normalize(os.path.abspath(...))` to remove `..` occurrences from the string and make it absolute (and then after, sure, use pathlib for the remainder of your program) -- possibly you even want to get the real case in the filesystem if you're on Windows (which is very annoying as then you end up needing to do a bunch of listdir() calls to get the case stored in the filesystem).

 -- but also, keep in mind you usually just need to do that on boundaries (i.e.: when you receive a relative path as an argument in the command line for instance, not when you receive an API call from another program -- chances are, the cwd of that program and your own are different and thus calls to `absolute()` or `resolve()` are actually bugs).

But isn't there any use-case for `Path.resolve()`?

As far as I know, the only case where it's valid is if you do want to create a cache of sorts where any representation of that file is considered the same (i.e.: you really want the canonical representation of the file). 

Say, in an IDE you opened your `subst` in `x:` and then you have `c:\project\foo.py` and `x:\foo.py` in a debugger and you want to always show the version that's opened under your IDE, you need to build that mapping so that regardless of the version that's accessed you will open the version that's being seen in your IDE (and oh my god, I've seen so many IDEs get that wrong and the effect is usually you'll have the same file opened under different names in the IDE -- I know that in the pydevd debugger this is especially tricky because the name of the file is gotten from a .pyc file, which depends on how it was generated and the cached version may not match the version that the user is currently seeing -- the  debugger goes through great lengths to have an internal canonical representation and a different version which should be what the user sees, but it's not perfect and IDEs which actually open the file afterwards don't make the due diligence to try to do the proper mapping afterwards -- even worse when they fail because the path is the same but the drive is uppercase and internally the path they have has the drive in lowercase).

Maybe even then it'd be better to use the inode of the file for a reference (gotten from `Path.stat()`)...

Anyways, to sum up, in its current format, I'd just recommend to ALMOST NEVER use `resolve()` as it's usually just the source of bugs, keep to `Path(os.path.normalize(os.path.abspath(...)))` on program boundaries as that's just saner in general (and when you want to pass files among APIs, make sure paths are already absolute and normalized and fail if they aren't).

p.s.: just as a disclaimer, I've also seen a very minor subset of users on Linux (just one so far really, but maybe there are others that want that behavior out there) that say they want the symlinks resolved as the created structure they have is always final and the symlinks are just a commodity to avoid having to cd into that structure so that they can freely cd into symlinks (and as such they don't want to affect the structure when they created symlinks), but I don't think this is the most common behavior (and if it is, then programs probably need to have a flag determining on whether to use it or not (for instance, I know vite from javascript has a resolve.preserveSymlinks setting so that the user can decide if they want it or not) -- although it's also true that most users don't even care, because they don't put their source code in a symlink or Windows subst 😊