~sschwarzer/ftputil#32: 
Exception when calling FTPHost.close after synchronize_times (Windows only)

Running the code

host = ftputil.FTPHost(hostname, user, password)
host.synchronize_times()
host.close()

triggers an exception

Traceback (most recent call last):
  File "g:\test\test.py", line 13, in <module>
    host.synchronize_times()
  File "g:\test\ftputil\ftputil.py", line 896, in __exit__
    self.close()
  File "g:\test\ftputil\ftputil.py", line 253, in close
    self.stat_cache.clear()
  File "g:\test\ftputil\ftp_stat_cache.py", line 117, in clear
    self.resize(0)
  File "g:\test\ftputil\ftp_stat_cache.py", line 100, in resize
    self._cache.size = new_size
  File "g:\test\ftputil\lrucache.py", line 186, in __setattr__
    del self.__dict[lru.key]
KeyError: '/htdocs/new/_ftputil_sync_'

with Python 2.6.1. This happens on Windows, but not under Ubuntu Intrepid.

(originally ​reported by Peter Stirling)

Status
RESOLVED FIXED
Submitter
schwa (unverified)
Assigned to
No-one
Submitted
15 years ago
Updated
15 years ago
Labels
bug library

schwa (unverified) 15 years ago · edit

I can reproduce the bug with Windows XP 3 and both Python 2.5.4 and 2.6.1. I can also confirm that it doesn't happen on Ubuntu 8.10 (Intrepid).

schwa (unverified) 15 years ago · edit

Peter Stirling ​wrote:

After further inspection it appears that the windows python2.6 install doesn't include a dll of the compiled heapq module

/usr/lib/python2.5/lib-dynload/_heapq.so, on ubuntu python2.5

However when I tried to test the pure python version it decided to work, so there must be a bad dll somewhere.

To test this I made a copy of the pure python heapq.py as heapq26.py, and commented out the code that replaces pure-python with the c versions (line 313). I then altered ftputil/lrucache.py(line 43) to import from this new module, and when I run the test program I no longer get the exception.

schwa (unverified) 15 years ago · edit

Running lrucache's tests also seems(*) to reveal the bug.

On Ubuntu:

$ python lrucache.py
<<class '__main__.LRUCache'> (0 elements)>
<<class '__main__.LRUCache'> (25 elements)>
<<class '__main__.LRUCache'> (24 elements)>
<<class '__main__.LRUCache'> (10 elements)>
<<class '__main__.LRUCache'> (10 elements)>
10
40
41
42
43
44
45
47
48
49
46
<<class '__main__.LRUCache'> (10 elements)>
1239353317.44
40
41
42
43
44
45
47
48
49
46

On Windows:

S:\ftputil>python lrucache.py
<<class '__main__.LRUCache'> (0 elements)>
<<class '__main__.LRUCache'> (25 elements)>
<<class '__main__.LRUCache'> (24 elements)>
Traceback (most recent call last):
  File "lrucache.py", line 219, in <module>
    cache.size = 10
  File "lrucache.py", line 195, in __setattr__
    del self.__dict[lru.key]
KeyError: 46

(*) We don't know for sure if it's the same bug but I think it's highly likely.

schwa (unverified) 15 years ago · edit

Peter, I can't reproduce your workaround with the modified heapq module. When I run the tests in the modified lrucache module (i. e. no ftputil involved), on Windows I still get the traceback for the lrucache tests. Also the traceback for the original ftputil usage, on which this issue is based, remains.

For comparison with your patches I attached mine. (Please ignore the debug code further down in lrucache.py.) What have we done differently?

schwa (unverified) 15 years ago · edit

I set the bug priority to critical because the problem might - in other contexts - apply to a lot of users of ftputil under Windows.

schwa (unverified) 14 years ago · edit

I made an extra branch, issue32, to investigate the bug. There, I combined the test code into a single module which contains code to reproduce the bug. Just run the script to see if the bug occurs. I also added the outputs of this script when run under Windows XP and Ubuntu Linux.

My results so far:

  • When overriding time.time() in the test code with time.clock() by assigning time.time = time.clock at the start of the script the output on Windows and Linux is the same, i. e. the bug no longer shows.
  • When using time.time() but prepended with time.sleep(2.0) (see function time_)--but without the following print statement--the bug still shows up.
  • When using the normal time.time() without the sleep delay but with just a print statement (not even the print t currently in the code), the bug does not occur.

So there seem to be issues with the way the "time managements" in the two environments work. Maybe this is also related to my Windows investigations happening in a virtual machine (VirtualBox 2.2). This might also explain why the bug hasn't been reported earlier as most Windows users assumedly will run a native Windows and so won't come across the bug (that is, if it really is limited to the virtual environment).

Peter, does the Windows Vista you mentioned in ​another mail run in a virtual machine? If yes, can you run the test script in a native Windows, i. e. outside any virtual machine, and does the bug happen there?

schwa (unverified) 14 years ago · edit

By the way, the Windows installation is a Windows XP SP3 (officially patched as of 2009-05-01 but without Windows Explorer 8). The Ubuntu Linux is version 8.10 with a kernel 2.6.27-11. The native OS runs on an AMD64 platform with a single processor/single core and without hardware virtualization support (this processor doesn't have any).

schwa (unverified) 14 years ago · edit

I think I got to the root cause, described in ​this and ​this web page.

As one can check with the attached program time_test.py, time.time measures more precise under Linux, time.clock under Windows. The output of the program under Linux is:

$ python time_test.py time
[1241286942.7352209, 1241286942.7352271, 1241286942.73523, 1241286942.7352331,
1241286942.7352359, 1241286942.735239, 1241286942.7352419, 1241286942.735245,
1241286942.7352469, 1241286942.73525]

[6.198883056640625e-06, 2.86102294921875e-06, 3.0994415283203125e-06,
2.86102294921875e-06, 3.0994415283203125e-06, 2.86102294921875e-06,
3.0994415283203125e-06, 1.9073486328125e-06, 3.0994415283203125e-06]

2.90870666504e-05

$ python time_test.py clock
[0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02]

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

0.0

while on Windows it's the other way around:

>python time_test.py time
[1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999,
1241287016.0969999, 1241287016.0969999, 1241287016.0969999, 1241287016.0969999,
1241287016.0969999, 1241287016.0969999]

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

0.0

>python time_test.py clock
[1.0336509249080539e-05, 1.6482541775560862e-05, 2.1511113842681123e-05,
2.6539685909801387e-05, 3.1568257976921645e-05, 3.7434925388561956e-05,
4.2184132340842205e-05, 4.7212704407962462e-05, 5.2241276475082727e-05,
5.7549213657043007e-05]

[6.1460325264803225e-06, 5.0285720671202609e-06, 5.0285720671202643e-06,
5.0285720671202575e-06, 5.8666674116403117e-06, 4.7492069522802485e-06,
5.0285720671202575e-06, 5.0285720671202643e-06, 5.3079371819602801e-06]

4.7212704408e-05

Because the lrucache module uses time.time irrespective of the platform, on Windows it assigns exactly the same timestamp to multiple cache nodes. This seems to confuse the heapq and/or lrucache module. I'll have to investigate this in more detail to find out what's going on here. :-)

ftputiluser (unverified) 14 years ago · edit

While I was trying to figure out where the (non-existant) bug was in the heapq module, I learned quite a lot about heaps. A heap is a special kind of binary tree where the keys of each node are greater than that of its parent (this means that you can remove the lowest value in the whole tree by removing the root of the tree, which is a low cost operation).

However for this to work you need a stable sort key, which is where the unix-windows fight causes the problem.

From the python docs:

time.clock()

On Unix, return the current processor time as a floating point number expressed in seconds. The precision, and in fact the very definition of the meaning of “processor time”, depends on that of the C function of the same name, but in any case, this is the function to use for benchmarking Python or timing algorithms.

On Windows, this function returns wall-clock seconds elapsed since the first call to this function, as a floating point number, based on the Win32 function QueryPerformanceCounter(). The resolution is typically better than one microsecond.

AND

time.time()

Return the time as a floating point number expressed in seconds since the epoch, in UTC. Note that even though the time is always returned as a floating point number, not all systems provide time with a better precision than 1 second. While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls.

Windows is one platform where you only get 1s precision from time.time() (there are others).

​A patch I made

This uses a synthetic key for all comparisons made within the heapq module, and with it the code runs identically on my windows box and ubuntu.

The patch could be simplified if you didn't want to retain the LRUCache.mtime(key) interface (returns the modification time of the cached item with that key)

Peter Stirling

schwa (unverified) 14 years ago · edit

Peter,

thanks for your patch! I've also thought about using just a counter. :-) Thanks to Python having long integers, there won't be any problems with integer overflow or "wrap-around" situations, and we won't have to deal with platform dependencies.

Though this approach should fix the problem, I wonder why lrucache gets so confused. Granted, if the time value provides no stable sort order, I'd expect lrucache to perhaps not delete the actually least recently used item from the cache. However, that doesn't explain the traceback for the cache's size change where it seems the code takes a node from the heap list self.__heap but no longer finds the node's key* in the dictionary self.__dict. I'd like to know why this happens before I declare the bug fixed.

Regarding mtime, I need this for the "maximum age" functionality described in the documentation.

  • Here, I mean the node's key attribute, not the "sort key" (which is supposed to be the last-access time) for the heap.

schwa (unverified) 14 years ago · edit

It seems the problem is in LRUCache.__delitem__. Running with a time function which assigns the same timestamp on every invocation, I find ...

node = self.__dict[key]

extracts the correct node by its key, in the case of the test program that's 16.

del self.__dict[key]

deletes this node via its key.

self.__heap.remove(node)

deletes a node by searching through the __heap list and comparing the value it finds there with the node. However, since this comparison is done by using Node's __cmp__ method, the remove method removes the very first item in the list, which not happens to be the intended 16 but 15.

So the cache object ends up with a heap list that doesn't match the dictionary.

As far as I can see, the resizing in __setattr__ seems to work correctly.

So increasing the comparison key with every new node, as we both thought of, should reliably fix the bug.

schwa (unverified) 14 years ago · edit

Fixed in changeset 855.

schwa (unverified) 14 years ago · edit

The fixed version I sent to the original author of lrucache consists of these changesets.

Register here or Log in to comment, or comment via email.