Opened 2 years ago

Last modified 10 months ago

#108 assigned defect

Broken symlinks generate needless lstat calls

Reported by: ftputiluser Owned by: schwa
Priority: major Milestone: 4.0.0
Component: Library Version:
Keywords: Cc:

Description

Hi,

I'm trying to walk() a directory on an FTP site that contains a large amount of broken symlinks to files in a different directory. The site in question is:

ftp://anonymous:anonymous@ftp.ebi.ac.uk/pub/databases/embl/new/

When walk('/pub/databases/embl/new/') runs, it sees a symlink to e.g. '../tsa/tsa_gfcx01_inv.dat.gz', and tries to follow it (even though followlinks is False). This starts an lstat of '/pub/databases/embl/tsa/' and the result of the directory listing is cached. If finds the file is missing and discards it.

However, when walk() hits the next broken symlink, it finds there's a cache miss for the file and does a new lstat of '/pub/databases/embl/tsa/' even though we just did it some milliseconds ago. This is then repeated for the hundreds of broken symlinks.

I have found two possible solutions:

1) Use the cache to detect that the directory has already been visited, and interpret a cache miss on the file as a broken link.

2) Change the semantics of walk(dir, followlinks=False) to not just discard directory symlinks, but also file symlinks. This is a small change in 'host.py':

--- host.py.orig	2017-01-27 16:20:04.633080980 +0100
+++ host.py	2017-01-27 16:20:22.027566063 +0100
@@ -905,7 +905,10 @@
             return
         dirs, nondirs = [], []
         for name in names:
-            if self.path.isdir(self.path.join(top, name)):
+            path = self.path.join(top, name)
+            if self.path.islink(path) and not followlinks:
+                continue
+            if self.path.isdir(path):
                 dirs.append(name)
             else:
                 nondirs.append(name)

Change History (9)

comment:1 Changed 2 years ago by schwa

Milestone: 3.3.2
Status: newassigned

Thank you for the ticket and especially the thorough explanation and the patch!

I'm currently reworking the testing code for ftputil. Likely work on this ticket will follow then.

Last edited 2 years ago by schwa (previous) (diff)

comment:2 Changed 2 years ago by ftputiluser

Thanks! I subclassed FTPHost to reimplement walk() with my patch, so this is no longer urgent.

I just realized the two solutions are not mutually exclusive. The cache version could benefit other code paths than walk().

Last edited 2 years ago by ftputiluser (previous) (diff)

comment:3 Changed 2 years ago by ftputiluser

It turns out we needed symlink support in the walk() method, so I had to track down the inefficiency in relation to broken symlinks. I'm posting a somewhat-ugly patch here just to give you an idea of the problem and solution, and as a sanity check on the solution I came up with.

To use the cache to detect broken links, we need to be able to use the cache to detect that a directory has been LIST'ed directly, not just that the path entered the cache as a result of a LIST of the parent directory. This distinction is not possible ATM, so I monkey-patched the cached StatResult? object to contain this info, as a minimal code change to show what is necessary. A proper solution would store this in a more official location.

Finally, I had to do some symlink resolving for file symlinks where the path is itself a symlink, to make sure we're detecting broken symlinks on the final destination, not just an intermediate step in a chain of file or directory symlinks. There's a possibility of endless looping there since I didn't make use of the visited_paths (to not over-complicate the patch).

There's a risk that this will report falsely broken links if the cache framework is disabled, or if the cache is pruned due to size overflow.

diff -r 40280d912474 ftputil/stat.py
--- a/ftputil/stat.py	Wed Jul 27 22:54:56 2016 +0200
+++ b/ftputil/stat.py	Mon Mar 06 13:58:33 2017 +0100
@@ -561,6 +561,10 @@
             loop_path = self._path.join(path, stat_result._st_name)
             self._lstat_cache[loop_path] = stat_result
             yield stat_result
+        # Mark the cache entry for 'path' so we know if this path has been dir()'ed directly. We will
+        # use this to detect broken symlinks.
+        if path != "/" and path in self._lstat_cache:
+            self._lstat_cache[path].dir_was_listed = True
 
     def _real_listdir(self, path):
         """
@@ -677,6 +681,21 @@
             dirname, _ = self._path.split(path)
             path = self._path.join(dirname, lstat_result._st_target)
             path = self._path.abspath(self._path.normpath(path))
+            while True:
+                dirname, filename = self._path.split(path)
+                dirname_lstat_result = self._real_lstat(dirname, _exception_for_missing_path)
+                if not stat.S_ISLNK(dirname_lstat_result.st_mode):
+                    break
+                parent_dirname, _ = self._path.split(dirname)
+                path = self._path.join(parent_dirname, dirname_lstat_result._st_target, filename)
+                path = self._path.abspath(self._path.normpath(path))
+            try:
+                dir_was_listed = getattr(self._lstat_cache[dirname], 'dir_was_listed', False)
+            except ftputil.error.CacheMissError:
+                dir_was_listed = False
+            if dir_was_listed and path not in self._lstat_cache:
+                print('WARNING: broken symlink %s -> %s' % (original_path, path))
+                return None
             # Check for cyclic structure.
             if path in visited_paths:
                 # We had seen this path already.
Last edited 10 months ago by ftputiluser (previous) (diff)

comment:4 Changed 2 years ago by schwa

Thanks again for an awesome update! :-)

Probably I'll find some time for ftputil work next month.

comment:5 Changed 2 years ago by ftputiluser

I had to patch the {walk()} method because it would still yield some broken symlinks. You have to check is the path is a dir AND if it's a file, else skip. See patch:

--- a/ftputil/host.py	Wed Jul 27 22:54:56 2016 +0200
+++ b/ftputil/host.py	Mon Mar 13 13:46:01 2017 +0100
@@ -905,10 +905,15 @@
             return
         dirs, nondirs = [], []
         for name in names:
-            if self.path.isdir(self.path.join(top, name)):
+            path = self.path.join(top, name)
+            if not followlinks and self.path.islink(path):
+                continue
+            if self.path.isdir(path):
                 dirs.append(name)
+            elif self.path.isfile(path):
+                nondirs.append(name)
             else:
-                nondirs.append(name)
+                pass
         if topdown:
             yield top, dirs, nondirs
         for name in dirs:

comment:6 Changed 20 months ago by schwa

Milestone: 3.3.24.0.0

comment:7 Changed 10 months ago by ftputiluser

Hello Stefan,

I'm working through our collection of local patches at work, in an attempt to reduce the list. Is there anything I can do to get these patches integrated?

Thanks, Erik

comment:8 in reply to:  7 Changed 10 months ago by schwa

Replying to ftputiluser:

I'm working through our collection of local patches at work, in an attempt to reduce the list. Is there anything I can do to get these patches integrated?

Good point. Which are the ftputil tickets you currently use patches for?

Generally, the chance of getting a patch integrated increases if you add unit tests that show how the old implementation fails and how the respective patch fixes the code and hence makes the test pass.

Can you please contact me via mail to sschwarzer (at) sschwarzer.net ?

comment:9 Changed 10 months ago by ftputiluser

Thanks, I'll have a look at the test suite to see if I can mock the erratic server behaviour.

By "local patches", I meant patches to all open source packages. I think this is our only outstanding patch for ftputil :-)

Note: See TracTickets for help on using tickets.