~sschwarzer/ftputil#108: 
Broken symlinks generate needless lstat calls

Hi,

I'm trying to walk() a directory on an FTP site that contains a large amount of broken symlinks to files in a different directory. The site in question is:

​ftp://anonymous:anonymous@ftp.ebi.ac.uk/pub/databases/embl/new/

When walk('/pub/databases/embl/new/') runs, it sees a symlink to e.g. '../tsa/tsa_gfcx01_inv.dat.gz', and tries to follow it (even though followlinks is False). This starts an lstat of '/pub/databases/embl/tsa/' and the result of the directory listing is cached. If finds the file is missing and discards it.

However, when walk() hits the next broken symlink, it finds there's a cache miss for the file and does a new lstat of '/pub/databases/embl/tsa/' even though we just did it some milliseconds ago. This is then repeated for the hundreds of broken symlinks.

I have found two possible solutions:

1) Use the cache to detect that the directory has already been visited, and interpret a cache miss on the file as a broken link.

2) Change the semantics of walk(dir, followlinks=False) to not just discard directory symlinks, but also file symlinks. This is a small change in 'host.py':

--- host.py.orig  2017-01-27 16:20:04.633080980 +0100
+++ host.py 2017-01-27 16:20:22.027566063 +0100
@@ -905,7 +905,10 @@
             return
         dirs, nondirs = [], []
         for name in names:
-            if self.path.isdir(self.path.join(top, name)):
+            path = self.path.join(top, name)
+            if self.path.islink(path) and not followlinks:
+                continue
+            if self.path.isdir(path):
                 dirs.append(name)
             else:
                 nondirs.append(name)
Status
RESOLVED DUPLICATE
Submitter
ftputiluser (unverified)
Assigned to
No-one
Submitted
7 years ago
Updated
7 years ago
Labels
bug library

schwa (unverified) 7 years ago · edit

Thank you for the ticket and especially the thorough explanation and the patch!

I'm currently reworking the testing code for ftputil. Likely work on this ticket will follow then.

ftputiluser (unverified) 7 years ago · edit

Thanks! I subclassed FTPHost to reimplement walk() with my patch, so this is no longer urgent.

I just realized the two solutions are not mutually exclusive. The cache version could benefit other code paths than walk().

ftputiluser (unverified) 7 years ago · edit

It turns out we needed symlink support in the walk() method, so I had to track down the inefficiency in relation to broken symlinks. I'm posting a somewhat-ugly patch here just to give you an idea of the problem and solution, and as a sanity check on the solution I came up with.

To use the cache to detect broken links, we need to be able to use the cache to detect that a directory has been LIST'ed directly, not just that the path entered the cache as a result of a LIST of the parent directory. This distinction is not possible ATM, so I monkey-patched the cached StatResult? object to contain this info, as a minimal code change to show what is necessary. A proper solution would store this in a more official location.

Finally, I had to do some symlink resolving for file symlinks where the path is itself a symlink, to make sure we're detecting broken symlinks on the final destination, not just an intermediate step in a chain of file or directory symlinks. There's a possibility of endless looping there since I didn't make use of the visited_paths (to not over-complicate the patch).

There's a risk that this will report falsely broken links if the cache framework is disabled, or if the cache is pruned due to size overflow.

diff -r 40280d912474 ftputil/stat.py
--- a/ftputil/stat.py   Wed Jul 27 22:54:56 2016 +0200
+++ b/ftputil/stat.py   Mon Mar 06 13:58:33 2017 +0100
@@ -561,6 +561,10 @@
             loop_path = self._path.join(path, stat_result._st_name)
             self._lstat_cache[loop_path] = stat_result
             yield stat_result
+        # Mark the cache entry for 'path' so we know if this path has been dir()'ed directly. We will
+        # use this to detect broken symlinks.
+        if path != "/" and path in self._lstat_cache:
+            self._lstat_cache[path].dir_was_listed = True
 
     def _real_listdir(self, path):
         """
@@ -677,6 +681,21 @@
             dirname, _ = self._path.split(path)
             path = self._path.join(dirname, lstat_result._st_target)
             path = self._path.abspath(self._path.normpath(path))
+            while True:
+                dirname, filename = self._path.split(path)
+                dirname_lstat_result = self._real_lstat(dirname, _exception_for_missing_path)
+                if not stat.S_ISLNK(dirname_lstat_result.st_mode):
+                    break
+                parent_dirname, _ = self._path.split(dirname)
+                path = self._path.join(parent_dirname, dirname_lstat_result._st_target, filename)
+                path = self._path.abspath(self._path.normpath(path))
+            try:
+                dir_was_listed = getattr(self._lstat_cache[dirname], 'dir_was_listed', False)
+            except ftputil.error.CacheMissError:
+                dir_was_listed = False
+            if dir_was_listed and path not in self._lstat_cache:
+                print('WARNING: broken symlink %s -> %s' % (original_path, path))
+                return None
             # Check for cyclic structure.
             if path in visited_paths:
                 # We had seen this path already.

schwa (unverified) 7 years ago · edit

Thanks again for an awesome update! :-)

Probably I'll find some time for ftputil work next month.

ftputiluser (unverified) 7 years ago · edit

I had to patch the {walk()} method because it would still yield some broken symlinks. You have to check is the path is a dir AND if it's a file, else skip. See patch:

--- a/ftputil/host.py Wed Jul 27 22:54:56 2016 +0200
+++ b/ftputil/host.py   Mon Mar 13 13:46:01 2017 +0100
@@ -905,10 +905,15 @@
             return
         dirs, nondirs = [], []
         for name in names:
-            if self.path.isdir(self.path.join(top, name)):
+            path = self.path.join(top, name)
+            if not followlinks and self.path.islink(path):
+                continue
+            if self.path.isdir(path):
                 dirs.append(name)
+            elif self.path.isfile(path):
+                nondirs.append(name)
             else:
-                nondirs.append(name)
+                pass
         if topdown:
             yield top, dirs, nondirs
         for name in dirs:

ftputiluser (unverified) 5 years ago · edit

Hello Stefan,

I'm working through our collection of local patches at work, in an attempt to reduce the list. Is there anything I can do to get these patches integrated?

Thanks, Erik

schwa (unverified) 5 years ago · edit

Replying to ftputiluser:

I'm working through our collection of local patches at work, in an attempt to reduce the list. Is there anything I can do to get these patches integrated?

Good point. Which are the ftputil tickets you currently use patches for?

Generally, the chance of getting a patch integrated increases if you add unit tests that show how the old implementation fails and how the respective patch fixes the code and hence makes the test pass.

Can you please contact me via mail to sschwarzer (at) sschwarzer.net ?

ftputiluser (unverified) 5 years ago · edit

Thanks, I'll have a look at the test suite to see if I can mock the erratic server behaviour.

By "local patches", I meant patches to all open source packages. I think this is our only outstanding patch for ftputil :-)

schwa (unverified) 4 years ago · edit

I had a look at some of the involved code and I think the problem is too hairy to approach for now.

I'm very hesistant to change the semantics of some code if this lets the semantics differ from Python's os.walk.

I fear I'll only work on this ticket if I decide to make a larger revision of the caching code and its interaction with ftputil.stat.

schwa (unverified) 4 years ago · edit

I added a ticket [#129](https://todo.sr.ht/~sschwarzer/ftputil/129 "#129: enhancement: Make cache more effective (supersedes #108) (assigned)"). Implementing these cache enhancements should help a lot with this ticket and would be more generic and hence probably help in more situations.

schwa (unverified) 4 years ago · edit

Removing the milestone because this very likely won't make it into 4.0.0 and I don't know in which version instead.

schwa (unverified) 4 years ago · edit

When I ran the script ticket108.py, it stopped after 76 minutes ("real" time) / 47 minutes ("user" time) with a ftputil.error.TemporaryError: 421 Timeout., so even without the timeout, the script would have run at least 76 minutes.

schwa (unverified) 4 years ago · edit

Superseded by ticket [#129](https://todo.sr.ht/~sschwarzer/ftputil/129 "#129: enhancement: Make cache more effective (supersedes #108) (assigned)"), therefore I close this ticket. It's not exactly a duplicate, but almost. ;-)

Register here or Log in to comment, or comment via email.