I'm attempting to download several files in this kind of (pseudo code) loop:
for fname in files:
size = src_host.stat(fname).st_size
with src_host.open(fname, 'rb') as src:
while True
data = src.read(BLOCKSIZE)
if not data:
break
This works fine for a few iterations, but then all of a sudden:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 634, in chdir
self._session.cwd(path)
File "/usr/lib/python3.6/ftplib.py", line 631, in cwd
return self.voidcmd(cmd)
File "/usr/lib/python3.6/ftplib.py", line 277, in voidcmd
self.putcmd(cmd)
File "/usr/lib/python3.6/ftplib.py", line 199, in putcmd
self.putline(line)
File "/usr/lib/python3.6/ftplib.py", line 194, in putline
self.sock.sendall(line.encode(self.encoding))
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./test.py", line 200, in <module>
main(sys.argv[1:])
File "./test.py", line 122, in main
size = src_host.stat(fname).st_size
File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 937, in stat
return self._stat._stat(path, _exception_for_missing_path)
File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 846, in _stat
self._real_stat, path, _exception_for_missing_path
File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 801, in __call_with_parser_retry
result = method(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 768, in _real_stat
lstat_result = self._real_lstat(path, _exception_for_missing_path)
File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 729, in _real_lstat
for stat_result in self._stat_results_from_dir(dirname):
File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 620, in _stat_results_from_dir
lines = self._host_dir(path)
File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 613, in _host_dir
return self._host._dir(path)
File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 889, in _dir
_FTPHost_dir_command, path, descend_deeply=True
File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 594, in _robust_ftp_command
self._check_inaccessible_login_directory()
File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 577, in _check_inaccessible_login_directory
self.chdir(presumable_login_dir)
File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 634, in chdir
self._session.cwd(path)
File "/usr/local/lib/python3.6/dist-packages/ftputil/error.py", line 172, in __exit__
raise FTPOSError(*exc_value.args, original_exception=exc_value)
ftputil.error.FTPOSError: [Errno 32] Broken pipe
Debugging info: ftputil 4.0.0, Python 3.6.9 (linux)
I performed a tcpdump while this was happening and noticed the following (10.103.3.23 is the server and 192.168.222.169 is the client):
14:34:29.480574 IP 10.103.3.23.21 > 192.168.222.169.41872: Flags [P.], seq 1430:1447, ack 547, win 258, options [nop,nop,TS val 2546511753 ecr 1147529281], length 17: FTP: 226 Transfer OK
14:34:29.480595 IP 192.168.222.169.41872 > 10.103.3.23.21: Flags [.], ack 1447, win 502, options [nop,nop,TS val 1147835876 ecr 2546511753], length 0
14:34:30.048785 IP 192.168.222.169.41872 > 10.103.3.23.21: Flags [F.], seq 547, ack 1447, win 502, options [nop,nop,TS val 1147836444 ecr 2546511753], length 0
14:34:30.066974 IP 10.103.3.23.21 > 192.168.222.169.41872: Flags [.], ack 548, win 258, options [nop,nop,TS val 2546511812 ecr 1147836444], length 0
14:34:30.067031 IP 10.103.3.23.21 > 192.168.222.169.41872: Flags [F.], seq 1447, ack 548, win 258, options [nop,nop,TS val 2546511812 ecr 1147836444], length 0
14:34:30.067046 IP 192.168.222.169.41872 > 10.103.3.23.21: Flags [.], ack 1448, win 502, options [nop,nop,TS val 1147836462 ecr 2546511812], length 0
Line 3 from the tcpdump shows that it's in fact the client that sends the FIN-flag to the server. So what's happening here?
Thanks for the report! :-)
First off, I have no idea yet why this happens, but I have a few questions which might help understanding the problem better.
When you say "This works fine for a few iterations", does that refer only to the inner loop or also to the outer loop? Can you download any files completely?
Is there anything special about the files? For example, very many, very short, very long files?
Does the transfer always stop at the same place, that is, is it always the same file that fails and is the amount of transferred/saved bytes the same? If not, do you see other patterns in the failures?
Is the server a public one? Is there any way to try out the failure myself?
I see you have a
stat
call that doesn't seem to be used in your code. Does it change anything if you leave this call out?Can you please show actual working code (not only pseudo code), ideally a standalone script, that triggers the failure and the full traceback? I see that the traceback starts at
FTPHost.chdir
, but there's no explicitchdir
call in your code. I don't know if it'll help, but I'd be curious which of your commands does thechdir
call.Regarding the script, of course you can and should mask confidential data, especially the password. :-)
Also:
How much time passes from entering the outer loop until you get the traceback?
Is there high load on the server or the client, apart from your FTP operations? Most, if not all, socket-level failures I've seen occurred under high load.
Hi!
- This is the outer loop, meaning for each new file I need to download. So in this case I was able to download 4 files completely until this failed.
- They are all the same size (around 500MB)
- The different files are 500MB each and we're downloading at roughly 27Mbps, so each file will take around 2.5 minutes to download. It seems like this occurs every time after the 4th file (at 10 minutes).
- No, unfortunately, it's a local server. But it's running FileZilla? server
- It's used, but in the pseudo code I didn't include it. This is in fact the whole code (I have rewritten the code to just exploit this bug):
#!/usr/bin/env python3 # import logging import os import sys import time import yaml # extra modules import ftputil def calculate_bw(amount, seconds): # amount in bytes, seconds in well, seconds bps = float(amount*8)/seconds # this should return it in mbps return bps/1000/1000 def main(arg): with open(arg[0], 'r') as f: cfg = yaml.load(f, Loader=yaml.FullLoader) src = cfg["src"] files = [ # the list of files to download - redacted from the output ] with ftputil.FTPHost( src["hostname"], src["username"], src["password"] ) as src_host: for fname in files: size = src_host.stat(fname).st_size print("Starting to transfer {} ({} bytes)".format( os.path.basename(fname), size, ) ) start = time.time() with src_host.open(fname, 'rb') as src: with open("/dev/null", 'wb') as dst: while True: data = src.read(cfg["blocksize"]) if not data: break dst.write(bytes(data)) seconds = time.time() - start print("Transferring {} took {} seconds ({} mbps)".format( os.path.basename(fname), seconds, calculate_bw(size, seconds) ) ) if __name__ == '__main__': main(sys.argv[1:])
I just commented out the src_host.stat() and now it runs without breaking.... Ok??
Thanks for the update.
Replying to ftputiluser:
I just commented out the src_host.stat() and now it runs without breaking.... Ok??
Not really ok, I'd say. ;-) Would be nice to solve the riddle.
Does the broken pipe traceback always come from the same line in your code (i. e. the stat call at the top of the loop or from the FTP file operations)?
Seems like it's always coming from the stat call. What is the FTP command for stat? I don't see how the stat happens on the wire...?
I just looked at what happened on the wire when I did this (I introduced a couple of time.sleep(5) around the actual stat call):
First I see an initial connect from source port 44164 towards the FTP server. That just logs in and then issues a PWD. Then my sleep. After that I see the .stat triggering the following FTP commands:
CWD / CWD / TYPE A PASV LIST CWD / CWD / CWD /folder1 TYPE A PASV LIST CWD / CWD / CWD /folder1/subfolder1 TYPE A PASV LIST CWD / CWD / CWD /folder1/subfolder1/subfolder2 # this is the folder where the file resides TYPE A PASV LIST CWD /
This surely is hugely ineffective? Anyway, moving on:
After my sleep I will then start downloading the file, so a *new* control connection (source port 44176) is opened over port 21. That logs in, does PWD, CWDs DIRECTLY into the folder where the file resides (stat method should learn from this) and then issues TYPE I, PASV and RETR to start downloading the file. File is being downloaded, but then after 120 seconds, I suddenly see a 421 Connection timed out from the FTP server for the first control connection (with source port 44164). I guess this will then render all future stat() futile.
Transfer is done 30 seconds later and this is communicated in the 44176 session with a 226 Transfer ok.
Then my sleep() cycle happens again, wrapped around the stat. And then the interesting stuff happens. I don't see anything on the wire, but still the size of the file has been returned back to the script - so this have surely been cached in the library somewhere. Accidentally the following 3 files are all identical in size, except from the 4th (the 5th file in total). The last file is smaller than the other 4 and THEN we have the broken pipe. I guess the library then just stumbled over the fact that 44164 is gone and there's nowhere to place the commands, thus the broken pipe.
What do you think?
Replying to ftputiluser:
Seems like it's always coming from the stat call. What is the FTP command for stat? I don't see how the stat happens on the wire...?
As far as I know, there's no direct correspondence to a stat call in FTP, though MLSD and MLST come close. At the time ftputil was originally written (about 2002!), these commands didn't exist. Meanwhile more servers support these commands, I assume, but probably not enough to always rely on those commands.
So how does ftputil do the "stat" call? By retrieving and parsing the directory which contains the item to stat. These are the
LIST
commands you see in the FTP debugging output.Although it's recommended to use MLSD if possible, I'll stick with the parsing because it works and doesn't depend on the server supporting the MLSD command. And I don't want to support both ways because the ftputil code is already complex enough. More on that later.
Replying to ftputiluser:
I just looked at what happened on the wire when I did this (I introduced a couple of time.sleep(5) around the actual stat call):
First I see an initial connect from source port 44164 towards the FTP server. That just logs in and then issues a PWD. Then my sleep. After that I see the .stat triggering the following FTP commands:
This surely is hugely ineffective?
If you look just at the FTP commands, yes. ;-) I changed the ftputil testing approach while working towards version 4.0.0 and was surprised by these identical calls as well. The reason for the repeated
CWD /
calls is this code in trac/browser/ftputil/host.py. I think the other directory changes are a side effect of other implicit stat calls (maybeisdir
). All of these read, parse and cache the whole directory listing.Anyway, moving on:
After my sleep I will then start downloading the file, so a *new* control connection (source port 44176) is opened over port 21.
Since FTP isn't stateless, it's not possible to do a
LIST
call while a file transfer is in progress. (There's a comment on this at the beginning of theFTPHost
class.) Because of this, ftputil creates (or reuses) an FTP connection for each FTP file-like object you open. To reduce the number of new FTP connections to create, ftputil tries to reuse FTP connections from closed FTP files.There's a check for timed-out "child sessions" in
FTPFile.open
, but maybe this fails in your case.That logs in, does PWD, CWDs DIRECTLY into the folder where the file resides (stat method should learn from this)
See above and other things I won't go into right now. ;-)
and then issues TYPE I, PASV and RETR to start downloading the file. File is being downloaded, but then after 120 seconds, I suddenly see a 421 Connection timed out from the FTP server for the first control connection (with source port 44164).
I was also thinking of a timeout, but didn't suggest it yet because I don't understand why the client would close the connection. Maybe that's a change in ftplib of newer Python versions. (ftputil uses ftplib internally.)
I guess this will then render all future stat() futile.
Yes. Unfortunately I don't know a way to work with the
FTPHost
instance once it timed out.Transfer is done 30 seconds later and this is communicated in the 44176 session with a 226 Transfer ok.
Yes, the child sessions for the file-like objects have their own FTP session internally and aren't affected by the timeout in the parent
FTPHost
.Then my sleep() cycle happens again, wrapped around the stat. And then the interesting stuff happens. I don't see anything on the wire, but still the size of the file has been returned back to the script - so this have surely been cached in the library somewhere.
Yes, see above. Since the stat results are cached by ftputil on the client side, only the first
stat
call for the directory actually results in FTP commands to the server, the remainingstat
calls are served from the cache. Unfortunately, this gives your "main connection" enough time to time out. So what you see isn't a timeout after 120 s, but a timeout after the 10 minutes, which is a common setting for FTP servers. That it happens just after the fifth file transfer is just a coincidence because by then the 10 minutes have passed which led to the timeout of the "main connection".Accidentally the following 3 files are all identical in size, except from the 4th (the 5th file in total). The last file is smaller than the other 4 and THEN we have the broken pipe. I guess the library then just stumbled over the fact that 44164 is gone and there's nowhere to place the commands, thus the broken pipe.
At some point I also suspected that there might be something wrong with the reuse of the child sessions, but I think what I described above is explanation enough.
ftputil does so many things behind the scenes that even I forget some of this sometimes. On the other hand, you could say the low-level stuff is so well-hidden/abstracted-away, that you tend to think in the high-level operations. ;-) From the user perspective of course that's the intention.
Now, what can you do about the timeout? You can't just call
FTPHost.getcwd
because the current directory is also cached. I think the best you can do is callingFTPHost.keep_alive()
. This sends an actualPWD
command to the server to prevent the timeout. (Of course, if the timeout happened before that, callingkeep_alive
is of no use, but in your case this should work.)What to do if you have long file transfers, so that the parent session would time out even while transferring a single file? You could call
FTPHost.keep_alive
during the file transfers, but that's a bit messy. By the way, you can also useFTPHost.download
(orFTPHost.download_if_newer
), which both support acallback
argument which could callkeep_alive
on the mainFTPHost
instance. But then, if you callkeep_alive
after each transfer chunk indownload
, this will unnecessarily slow down the transfer. In this case, you can use the more low-levelcopyfileobj
in theftputil.file_transfer
module. This function has achunk_size
argument, so you can indirectly control how frequently your callback is called.The timeout of FTP sessions is really annoying to account for, but I don't see a solution for that, only workarounds. To hide the timeout problem, it might technically be possible to call
keep_alive
from a thread, but that's rather heavy-handed and ftputil would need to make sure that thePWD
call for the timeout prevention doesn't run in parallel to any other FTP operation on the some connection. This would involve using locks, complicating matters further. I haven't thought about this in detail; there might even be possibilities for deadlocks. All in all, it's probably out of proportion to put such a workaround in ftputil.Does this help? I'm interested in your feedback, including more ideas to deal with the timeout problem.
I'm closing this ticket for now. It seems to be about a timeout problem and I'd need more information from the reporter to help. However, I haven't gotten any reply for over a year.