PyDev adventures

Friday, January 17, 2025

Using (or really misusing) Path.resolve() in Python

I just stumbled upon a code in Python that uses a simple `Path(...).resolve()` after receiving a path in an API and suddenly reminded myself on the many the bugs I've tripped over due to it, so, decided to share why I hope at some point this function stops being so pervasive as it's usually the wrong thing to call.. (maybe because when Python went from os.path to pathlib they added no real replacement and now you need to use both).

The main reason of bugs for that is that when you call Path.resolve() it'll actually resolve symlinks (and substs on Windows).

This means that if the user crafted some structure as

/myproject
/myproject/package.json
/myproject/src -> symlink to sources in /allmysources/temp

If you resolve `/myproject/src/myfile.py` to `/allmysources/temp/myfile.py` and actually want to get to the `package.json`, it'll never be found!

I was even looking at the python docs and found this nugget when talking about `Path.parent`:

If you want to walk an arbitrary filesystem path upwards, it is recommended to first call Path.resolve() so as to resolve symlinks and eliminate ".." components.

And that's completely why you shouldn't use it because if you do, then the parents will be completely off!

So, what should you do?

Unfortunately pathlib is not your friend here, you need to use `os.path.normpath(os.path.abspath(...))` to remove `..` occurrences from the string and make it absolute (and then after, sure, use pathlib for the remainder of your program) -- possibly you even want to get the real case in the filesystem if you're on Windows (which is very annoying as then you end up needing to do a bunch of listdir() calls to get the case stored in the filesystem).

-- but also, keep in mind you usually just need to do that on boundaries (i.e.: when you receive a relative path as an argument in the command line for instance, not when you receive an API call from another program -- chances are, the cwd of that program and your own are different and thus calls to `absolute()` or `resolve()` are actually bugs).

But isn't there any use-case for `Path.resolve()`?

As far as I know, the only case where it's valid is if you do want to create a cache of sorts where any representation of that file is considered the same (i.e.: you really want the canonical representation of the file).

Say, in an IDE you opened your `subst` in `x:` and then you have `c:\project\foo.py` and `x:\foo.py` in a debugger and you want to always show the version that's opened under your IDE, you need to build that mapping so that regardless of the version that's accessed you will open the version that's being seen in your IDE (and oh my god, I've seen so many IDEs get that wrong and the effect is usually you'll have the same file opened under different names in the IDE -- I know that in the pydevd debugger this is especially tricky because the name of the file is gotten from a .pyc file, which depends on how it was generated and the cached version may not match the version that the user is currently seeing -- the debugger goes through great lengths to have an internal canonical representation and a different version which should be what the user sees, but it's not perfect and IDEs which actually open the file afterwards don't make the due diligence to try to do the proper mapping afterwards -- even worse when they fail because the path is the same but the drive is uppercase and internally the path they have has the drive in lowercase).

Maybe even then it'd be better to use the inode of the file for a reference (gotten from `Path.stat()`)...

Anyways, to sum up, in its current format, I'd just recommend to ALMOST NEVER use `resolve()` as it's usually just the source of bugs, keep to `Path(os.path.normpath(os.path.abspath(...)))` on program boundaries as that's just saner in general (and when you want to pass files among APIs, make sure paths are already absolute and normalized and fail if they aren't).

p.s.: just as a disclaimer, I've also seen a very minor subset of users on Linux (just one so far really, but maybe there are others that want that behavior out there) that say they want the symlinks resolved as the created structure they have is always final and the symlinks are just a commodity to avoid having to cd into that structure so that they can freely cd into symlinks (and as such they don't want to affect the structure when they created symlinks), but I don't think this is the most common behavior (and if it is, then programs probably need to have a flag determining on whether to use it or not (for instance, I know vite from javascript has a resolve.preserveSymlinks setting so that the user can decide if they want it or not) -- although it's also true that most users don't even care, because they don't put their source code in a symlink or Windows subst 😊

Sunday, February 04, 2024

PyDev Debugger and sys.monitoring (PEP 669) -- i.e.: really fast debugging for Python 3.12!

The latest release of PyDev (12.0.0) is now available and it brings a really nice speed improvement for those who are already in Python 3.12! -- If you're a LiClipse user, it's now available in LiClipse 11.

The PyDev Debugger now uses sys.monitoring, which enables faster debugging (on my tests it can be up to 15 times faster than the version using sys.setttrace, depending on the use case -- kudos to Mark Shannon for PEP 669 😉).

It took me a while to cover the many scenarios that pydevd deals with, but the good thing is that most of the infrastructure available in pydevd didn't need changes (the debugger already had all the concepts needed as it already tried to trace only the frames which were actually needed, which definitely helped a lot as this is now pretty central in how to deal with the tracing using sys.monitoring)

Given that it's now out, I'll talk about how it works and some of the caveats when using it.

So, the first thing to note is that PEP 669 defines a bunch of callbacks which are now related to the code object (and not the frame as happened with sys.settrace) and when inside one of those callbacks the debugger can react to decide what should happen.

Some things that could happen could be pausing due to a breakpoint or a step instruction or deciding that the given code should not be traced again (by returning a DISABLE).

The reason it becomes faster than the previous approach is that the DISABLE is then considered by the Python interpreter which then bakes that DISABLE into the code when it's executing (up until sys.monitoring.restart_events() is called again). This does come with a big caveat though: if there are multiple threads running the program and one of those threads returns a DISABLE instruction then the related callback will actually be disabled for all the threads. Note that if DISABLE is not returned, the speed ends up being close to what was available with sys.settrace (which wasn't all that bad in pydevd already, but definitely a step down from what is now available).

This means that the debugger is really much faster when going for a breakpoint (because it can DISABLE many of the tracing instructions), but after a breakpoint is hit, if one thread is doing a step operation, then the speed reverts back to being close to the sys.settrace variant (the debugger can still DISABLE tracing for places it knows the user never wants to stop, but it cannot DISABLE the tracing for any code in any thread where the user may want to stop, because the thread which is stepping could be affected by a DISABLE in another thread which is not stepping, or at least the stepping must be considered for all threads even if a given thread hasn't really stopped and is not stepping).

Also, it's worth to mention that sys.monitoring has finer grained events vs. the ones available in sys.settrace, which is good as for instance, it's possible to get events from exceptions separate from events related to entering/returning from a function or lines, which helps the debugger in tracing only what's actually needed (the tracing debugger had to do lots of gymnastics to create a separate tracer just for exceptions when there would be no need to trace a given scope for lines, so, the new code ends up being both simpler and faster).

Note however that in pydevd I've done some other improvements and stepping should be more responsive even when using sys.settrace with older versions of Python!

p.s.: the attach to process is still not available for Python 3.12.

Note: back in the day other players which make use of pydevd such as JetBrains and Microsoft did step up to finance it, but at this point the only support it has is from backers in the community.

So, if you enjoy using pydevd please consider becoming a supporter... I've actually just setup sponsoring through GitHub sponsorships (https://github.com/sponsors/fabioz). Any takes on becoming one of the first backers there 😉?

Friday, June 23, 2023

robocorp.log: a library to answer what happened in a Python run.

It's been a while since I don't post, so, I decided to shed some details on what I'm working now (which I think is a really nice -- open source -- tool for the Python ecosystem -- in my view it's something close to a "Time travel debugger" for Python, although it's mostly labelled as automatic logging for Python 😄).

I'll start with a bit of history though: I'm working with Robocorp right now (Robocorp is a company which focuses on automation projects -- it leverages and provides many open source tools, especially in the Python ecosystem).

A few years back Robocorp approached me to do a language server for Robot Framework (which started as a testing framework done in Python and right now is also used for doing automations, which Robocorp recommends for many clients doing automations -- although many clients also prefer to just use Python directly).

Now, one of the strong points of Robot Framework is that it generates an output which provides information on everything that happened during its execution (Keyword calls -- which is how "methods" are called in Robot Framework, if statements, return values, etc), so, it's almost a "Time travel" debugger as it records and shows information on everything that happened in the execution.

Now, this brings me to what I'm working right now: a library which records what happens inside a Python process -- along with a UI which makes it possible to inspect it afterwards (its the Python log counterpart of Robot Framework).

For those curious I've created a repo with one example showing the output from Robot Framework and from the one generated by the Robocorp's Python Framework: https://github.com/fabioz/log_examples.

You can also see a live example (which solves the https://rpachallenge.com/) output of Robot Framework log.html as well as the output of Robocorp's Python Framework log.html.

The easiest way to use it right now is by using the tasks from Robocorp's Python Framework (see: https://github.com/robocorp/robo/tree/master/tasks as the log will be automatically generated for tasks run through it -- mostly, mark your entry points with @task (from robocorp.tasks import task) and run with `python -m robocorp.tasks run` and get your `log.html` in the `output` showing method calls, if statements, assigns, returns, etc that happened in your run).

One question I got a few times is how does it work... Well, after working quite a bit on pydevd and debugpy one of the things scratched right from the start was trying to use Python debugger infrastructure due to a simple fact: if the debugger infrastructure was used then no one could actually debug the code while using the logging framework.

In the end, inspiration ended up coming from PyTest. p.s.: thanks to Bruno Oliveira for a discussion back in the day about how assertions were rewritten using import hooks to provide nicer messages in assertion failures in PyTest.

robocorp.log uses the same approach as PyTest (import hooks + ast rewriting), but instead of rewriting asserts it rewrites the whole method to add callbacks on what's happening.

So, something as:

def method():

a = 1

roughly becomes something as:

def method():

report_method_start(..., 'method')

a = 1

report_assign(..., 'a', a)

report_method_end()

The real output is a bit more contrived as it needs to deal with exceptions, yields, making sure the stack is correct, but I hope you get the idea.

Then, the callbacks are converted into a kind of journal of what happens and that is then fed to the log.html (the idea is having it directly in a view in VSCode in the future so that you can see the log being created in real time -- right now the info is added to a bunch of "robolog" files and embedded into the final log.html).

Now, the approach does come with one caveat: import hooks need to be setup prior to importing a module, code imported before setting it up won't be traced -- it's one of the reasons why it's recommended to use robocorp.tasks to do the launching instead of bolting the logging manually as it makes sure things happen in the proper order.

The second caveat is related to the object __repr__. The framework is quite keen on getting the representation from objects at various times, so, if the __repr__ is too slow the execution may be much slower or even worse, if it has side-effects bad things will happen (thankfully most objects do the right thing as a __repr__ with side effects is a bug and the program would misbehave on debuggers too).

The third caveat is that right now it needs to be told what needs to be traced (by default full logging is available for all code which is considered user code and libraries need to be manually specified to be logged when it's called directly from a function which is considered user code -- this may change in the future, but it's the current state of affairs).

The final caveat is that right now it'll only trace what's happening in the main thread. Other threads won't appear in the log.

Well, I guess this post is already a bit bigger than I planned so I'll stop here. For those interested in testing or reporting enhancements/bugs see: https://github.com/robocorp/robo/.

Enjoy!

Thursday, August 11, 2022

PyDev debugger: Going from async to sync to async... oh, wait.

In Python asyncio land it's always a bit of a hassle when you have existing code which runs in sync mode which needs to be retrofitted to run in async, but it's usually doable -- in many cases, slapping async on the top of a bunch of definitions and adding the needed await statements where needed does the trick -- even though it's not always that easy.

Now, unfortunately a debugger has no such option. You see, a debugger needs to work on the boundaries of callbacks which are called from python (i.e.: it will usually do a busy wait from a line event from a callback registered in sys.settrace which is always called as a sync call).

Still, users still want to do some evaluation in the breakpoint context which would await... What now? Classic questions of how to go from async to sync say this is not possible.

This happens because to run something in asynchronous fashion an asyncio loop must be used to run it, but alas, the current loop is paused in the breakpoint and due to how asyncio is implemented in Python the asyncio loop is not reentrant, so, we can't just ask the loop to keep on processing at a certain point -- note that not all loops are equal, so, this is mostly an implementation detail on how CPython has implemented it, but unless we want to monkey-patch many things to make it reentrant, this would be a no-no... also, even if possible, it's not possible in asyncio to force a given coroutine to execute, rather we schedule it and asyncio decides when it'll run afterwards).

My initial naive attempt was just creating a new event loop, but again, CPython gets in the way because 2 event loops can't even coexist in the same thread. Then I thought about recreating the asyncio loop and got a bit further (up to being able to evaluate an asyncio.sleep coroutine), but after checking the asyncio AbstractEventLoop it became clear that the API is just too big to reimplement safely (it's not just about implementing the loop, it's also about implementing network I/O such as getnameinfo, create_connection, etc).

In the end the solution implemented for the debugger is that to support await constructs for evaluation, a new thread is created with a new event loop and that event loop in that new thread will execute the coroutine (with the context of the paused frame passed to that thread for the evaluation).

This is not perfect as there are some cons, for instance, evaluating the code in a thread can mean that some evaluations may not work because some frameworks such as qt consider the UI thread as special and won't work properly, checks for the current thread won't match the thread paused and probably a bunch of other things, but I guess it's a reasonable tradeoff vs not having it at all as it should work in the majority of cases.

Keep an eye open for the next release as it'll be possible to await coroutines in the debugger evaluation and watches ;)

p.s.: For VSCode users this will also be available in debugpy.

Thursday, March 10, 2022

PyDev 9.3.0 (debugger improvements / last version with Python 2.7 - 3.5 support)

PyDev 9.3.0 is now available.

The main changes in this release are related to the debugger, with improvements such as:

Major issue fixed issue where variable children sometimes wouldn't expand correctly.
Fixed some case where automatic connection to subprocesses wouldn't work.
Debugging with Pandas is much improved with the addition of some custom converters.
Opt-in support to show paused greenlets by setting GEVENT_SHOW_PAUSED_GREENLETS=1.
Support for newer versions of gevent.
Information on user settings paths is cached to fix issue with slow debugging using pipenv project.
A warning is shown if getting some attribute / getting its repr is slow.
Interactively inspect matplotlib plots when the QtAgg backend is used.
Support for PySide2.
Better error messages in case Python 3.11 frozen modules are being used.

Also noteworthy is that this will be the last release supporting older versions of Python including Python 2.7 up to Python 3.5. Newer releases will only support Python 3.6 onwards.