Opened 11 months ago

Last modified 11 months ago

#141 new defect

Broken pipe when downloading several files

Reported by: ftputiluser Owned by: schwa
Priority: major Milestone:
Component: Library Version:
Keywords: Cc:

Description

I'm attempting to download several files in this kind of (pseudo code) loop:

for fname in files:
  size = src_host.stat(fname).st_size
  with src_host.open(fname, 'rb') as src:
    while True
      data = src.read(BLOCKSIZE)
      if not data:
        break

This works fine for a few iterations, but then all of a sudden:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 634, in chdir
    self._session.cwd(path)
  File "/usr/lib/python3.6/ftplib.py", line 631, in cwd
    return self.voidcmd(cmd)
  File "/usr/lib/python3.6/ftplib.py", line 277, in voidcmd
    self.putcmd(cmd)
  File "/usr/lib/python3.6/ftplib.py", line 199, in putcmd
    self.putline(line)
  File "/usr/lib/python3.6/ftplib.py", line 194, in putline
    self.sock.sendall(line.encode(self.encoding))
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./test.py", line 200, in <module>
    main(sys.argv[1:])
  File "./test.py", line 122, in main
    size = src_host.stat(fname).st_size
  File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 937, in stat
    return self._stat._stat(path, _exception_for_missing_path)
  File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 846, in _stat
    self._real_stat, path, _exception_for_missing_path
  File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 801, in __call_with_parser_retry
    result = method(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 768, in _real_stat
    lstat_result = self._real_lstat(path, _exception_for_missing_path)
  File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 729, in _real_lstat
    for stat_result in self._stat_results_from_dir(dirname):
  File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 620, in _stat_results_from_dir
    lines = self._host_dir(path)
  File "/usr/local/lib/python3.6/dist-packages/ftputil/stat.py", line 613, in _host_dir
    return self._host._dir(path)
  File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 889, in _dir
    _FTPHost_dir_command, path, descend_deeply=True
  File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 594, in _robust_ftp_command
    self._check_inaccessible_login_directory()
  File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 577, in _check_inaccessible_login_directory
    self.chdir(presumable_login_dir)
  File "/usr/local/lib/python3.6/dist-packages/ftputil/host.py", line 634, in chdir
    self._session.cwd(path)
  File "/usr/local/lib/python3.6/dist-packages/ftputil/error.py", line 172, in __exit__
    raise FTPOSError(*exc_value.args, original_exception=exc_value)
ftputil.error.FTPOSError: [Errno 32] Broken pipe
Debugging info: ftputil 4.0.0, Python 3.6.9 (linux)

I performed a tcpdump while this was happening and noticed the following (10.103.3.23 is the server and 192.168.222.169 is the client):

14:34:29.480574 IP 10.103.3.23.21 > 192.168.222.169.41872: Flags [P.], seq 1430:1447, ack 547, win 258, options [nop,nop,TS val 2546511753 ecr 1147529281], length 17: FTP: 226 Transfer OK
14:34:29.480595 IP 192.168.222.169.41872 > 10.103.3.23.21: Flags [.], ack 1447, win 502, options [nop,nop,TS val 1147835876 ecr 2546511753], length 0
14:34:30.048785 IP 192.168.222.169.41872 > 10.103.3.23.21: Flags [F.], seq 547, ack 1447, win 502, options [nop,nop,TS val 1147836444 ecr 2546511753], length 0
14:34:30.066974 IP 10.103.3.23.21 > 192.168.222.169.41872: Flags [.], ack 548, win 258, options [nop,nop,TS val 2546511812 ecr 1147836444], length 0
14:34:30.067031 IP 10.103.3.23.21 > 192.168.222.169.41872: Flags [F.], seq 1447, ack 548, win 258, options [nop,nop,TS val 2546511812 ecr 1147836444], length 0
14:34:30.067046 IP 192.168.222.169.41872 > 10.103.3.23.21: Flags [.], ack 1448, win 502, options [nop,nop,TS val 1147836462 ecr 2546511812], length 0

Line 3 from the tcpdump shows that it's in fact the client that sends the FIN-flag to the server. So what's happening here?

Change History (10)

comment:1 Changed 11 months ago by ftputiluser

Summary: Broken pipe when uploading several filesBroken pipe when downloading several files

comment:2 Changed 11 months ago by schwa

Thanks for the report! :-)

First off, I have no idea yet why this happens, but I have a few questions which might help understanding the problem better.

  • When you say "This works fine for a few iterations", does that refer only to the inner loop or also to the outer loop? Can you download any files completely?
  • Is there anything special about the files? For example, very many, very short, very long files?
  • Does the transfer always stop at the same place, that is, is it always the same file that fails and is the amount of transferred/saved bytes the same? If not, do you see other patterns in the failures?
  • Is the server a public one? Is there any way to try out the failure myself?
  • I see you have a stat call that doesn't seem to be used in your code. Does it change anything if you leave this call out?
  • Can you please show actual working code (not only pseudo code), ideally a standalone script, that triggers the failure and the full traceback? I see that the traceback starts at FTPHost.chdir, but there's no explicit chdir call in your code. I don't know if it'll help, but I'd be curious which of your commands does the chdir call.

Regarding the script, of course you can and should mask confidential data, especially the password. :-)

comment:3 Changed 11 months ago by schwa

Also:

  • How much time passes from entering the outer loop until you get the traceback?
  • Is there high load on the server or the client, apart from your FTP operations? Most, if not all, socket-level failures I've seen occurred under high load.

comment:4 Changed 11 months ago by ftputiluser

Hi!

  • This is the outer loop, meaning for each new file I need to download. So in this case I was able to download 4 files completely until this failed.
  • They are all the same size (around 500MB)
  • The different files are 500MB each and we're downloading at roughly 27Mbps, so each file will take around 2.5 minutes to download. It seems like this occurs every time after the 4th file (at 10 minutes).
  • No, unfortunately, it's a local server. But it's running FileZilla? server
  • It's used, but in the pseudo code I didn't include it. This is in fact the whole code (I have rewritten the code to just exploit this bug):
#!/usr/bin/env python3

# import logging
import os
import sys
import time
import yaml

# extra modules
import ftputil


def calculate_bw(amount, seconds):
    # amount in bytes, seconds in well, seconds
    bps = float(amount*8)/seconds

    # this should return it in mbps
    return bps/1000/1000


def main(arg):

    with open(arg[0], 'r') as f:
        cfg = yaml.load(f, Loader=yaml.FullLoader)

    src = cfg["src"]

    files = [
# the list of files to download - redacted from the output
        ]

    with ftputil.FTPHost(
            src["hostname"],
            src["username"],
            src["password"]
            ) as src_host:

        for fname in files:

            size = src_host.stat(fname).st_size

            print("Starting to transfer {} ({} bytes)".format(
                os.path.basename(fname),
                size,
                )
            )

            start = time.time()

            with src_host.open(fname, 'rb') as src:
                with open("/dev/null", 'wb') as dst:
                    while True:
                        data = src.read(cfg["blocksize"])
                        if not data:
                            break

                        dst.write(bytes(data))

            seconds = time.time() - start

            print("Transferring {} took {} seconds ({} mbps)".format(
                os.path.basename(fname),
                seconds,
                calculate_bw(size, seconds)
            )
            )


if __name__ == '__main__':
    main(sys.argv[1:])

comment:5 Changed 11 months ago by ftputiluser

I just commented out the src_host.stat() and now it runs without breaking.... Ok??

comment:6 in reply to:  5 Changed 11 months ago by schwa

Thanks for the update.

Replying to ftputiluser:

I just commented out the src_host.stat() and now it runs without breaking.... Ok??

Not really ok, I'd say. ;-) Would be nice to solve the riddle.

Does the broken pipe traceback always come from the same line in your code (i. e. the stat call at the top of the loop or from the FTP file operations)?

Last edited 11 months ago by schwa (previous) (diff)

comment:7 Changed 11 months ago by ftputiluser

Seems like it's always coming from the stat call. What is the FTP command for stat? I don't see how the stat happens on the wire...?

comment:8 Changed 11 months ago by ftputiluser

I just looked at what happened on the wire when I did this (I introduced a couple of time.sleep(5) around the actual stat call):

First I see an initial connect from source port 44164 towards the FTP server. That just logs in and then issues a PWD. Then my sleep. After that I see the .stat triggering the following FTP commands:

CWD /
CWD /
TYPE A
PASV
LIST
CWD /
CWD /
CWD /folder1
TYPE A
PASV
LIST
CWD /
CWD /
CWD /folder1/subfolder1
TYPE A
PASV
LIST
CWD /
CWD /
CWD /folder1/subfolder1/subfolder2 # this is the folder where the file resides
TYPE A
PASV
LIST
CWD /

This surely is hugely ineffective? Anyway, moving on:

After my sleep I will then start downloading the file, so a *new* control connection (source port 44176) is opened over port 21. That logs in, does PWD, CWDs DIRECTLY into the folder where the file resides (stat method should learn from this) and then issues TYPE I, PASV and RETR to start downloading the file. File is being downloaded, but then after 120 seconds, I suddenly see a 421 Connection timed out from the FTP server for the first control connection (with source port 44164). I guess this will then render all future stat() futile.

Transfer is done 30 seconds later and this is communicated in the 44176 session with a 226 Transfer ok.

Then my sleep() cycle happens again, wrapped around the stat. And then the interesting stuff happens. I don't see anything on the wire, but still the size of the file has been returned back to the script - so this have surely been cached in the library somewhere. Accidentally the following 3 files are all identical in size, except from the 4th (the 5th file in total). The last file is smaller than the other 4 and THEN we have the broken pipe. I guess the library then just stumbled over the fact that 44164 is gone and there's nowhere to place the commands, thus the broken pipe.

What do you think?

comment:9 in reply to:  7 Changed 11 months ago by schwa

Replying to ftputiluser:

Seems like it's always coming from the stat call. What is the FTP command for stat? I don't see how the stat happens on the wire...?

As far as I know, there's no direct correspondence to a stat call in FTP, though MLSD and MLST come close. At the time ftputil was originally written (about 2002!), these commands didn't exist. Meanwhile more servers support these commands, I assume, but probably not enough to always rely on those commands.

So how does ftputil do the "stat" call? By retrieving and parsing the directory which contains the item to stat. These are the LIST commands you see in the FTP debugging output.

Although it's recommended to use MLSD if possible, I'll stick with the parsing because it works and doesn't depend on the server supporting the MLSD command. And I don't want to support both ways because the ftputil code is already complex enough. More on that later.

comment:10 in reply to:  8 Changed 11 months ago by schwa

Replying to ftputiluser:

I just looked at what happened on the wire when I did this (I introduced a couple of time.sleep(5) around the actual stat call):

First I see an initial connect from source port 44164 towards the FTP server. That just logs in and then issues a PWD. Then my sleep. After that I see the .stat triggering the following FTP commands:

This surely is hugely ineffective?

If you look just at the FTP commands, yes. ;-) I changed the ftputil testing approach while working towards version 4.0.0 and was surprised by these identical calls as well. The reason for the repeated CWD / calls is this code in trac/browser/ftputil/host.py. I think the other directory changes are a side effect of other implicit stat calls (maybe isdir). All of these read, parse and cache the whole directory listing.

Anyway, moving on:

After my sleep I will then start downloading the file, so a *new* control connection (source port 44176) is opened over port 21.

Since FTP isn't stateless, it's not possible to do a LIST call while a file transfer is in progress. (There's a comment on this at the beginning of the FTPHost class.) Because of this, ftputil creates (or reuses) an FTP connection for each FTP file-like object you open. To reduce the number of new FTP connections to create, ftputil tries to reuse FTP connections from closed FTP files.

There's a check for timed-out "child sessions" in FTPFile.open, but maybe this fails in your case.

That logs in, does PWD, CWDs DIRECTLY into the folder where the file resides (stat method should learn from this)

See above and other things I won't go into right now. ;-)

and then issues TYPE I, PASV and RETR to start downloading the file. File is being downloaded, but then after 120 seconds, I suddenly see a 421 Connection timed out from the FTP server for the first control connection (with source port 44164).

I was also thinking of a timeout, but didn't suggest it yet because I don't understand why the client would close the connection. Maybe that's a change in ftplib of newer Python versions. (ftputil uses ftplib internally.)

I guess this will then render all future stat() futile.

Yes. Unfortunately I don't know a way to work with the FTPHost instance once it timed out.

Transfer is done 30 seconds later and this is communicated in the 44176 session with a 226 Transfer ok.

Yes, the child sessions for the file-like objects have their own FTP session internally and aren't affected by the timeout in the parent FTPHost.

Then my sleep() cycle happens again, wrapped around the stat. And then the interesting stuff happens. I don't see anything on the wire, but still the size of the file has been returned back to the script - so this have surely been cached in the library somewhere.

Yes, see above. Since the stat results are cached by ftputil on the client side, only the first stat call for the directory actually results in FTP commands to the server, the remaining stat calls are served from the cache. Unfortunately, this gives your "main connection" enough time to time out. So what you see isn't a timeout after 120 s, but a timeout after the 10 minutes, which is a common setting for FTP servers. That it happens just after the fifth file transfer is just a coincidence because by then the 10 minutes have passed which led to the timeout of the "main connection".

Accidentally the following 3 files are all identical in size, except from the 4th (the 5th file in total). The last file is smaller than the other 4 and THEN we have the broken pipe. I guess the library then just stumbled over the fact that 44164 is gone and there's nowhere to place the commands, thus the broken pipe.

At some point I also suspected that there might be something wrong with the reuse of the child sessions, but I think what I described above is explanation enough.

ftputil does so many things behind the scenes that even I forget some of this sometimes. On the other hand, you could say the low-level stuff is so well-hidden/abstracted-away, that you tend to think in the high-level operations. ;-) From the user perspective of course that's the intention.

Now, what can you do about the timeout? You can't just call FTPHost.getcwd because the current directory is also cached. I think the best you can do is calling FTPHost.keep_alive(). This sends an actual PWD command to the server to prevent the timeout. (Of course, if the timeout happened before that, calling keep_alive is of no use, but in your case this should work.)

What to do if you have long file transfers, so that the parent session would time out even while transferring a single file? You could call FTPHost.keep_alive during the file transfers, but that's a bit messy. By the way, you can also use FTPHost.download (or FTPHost.download_if_newer), which both support a callback argument which could call keep_alive on the main FTPHost instance. But then, if you call keep_alive after each transfer chunk in download, this will unnecessarily slow down the transfer. In this case, you can use the more low-level copyfileobj in the ftputil.file_transfer module. This function has a chunk_size argument, so you can indirectly control how frequently your callback is called.

The timeout of FTP sessions is really annoying to account for, but I don't see a solution for that, only workarounds. To hide the timeout problem, it might technically be possible to call keep_alive from a thread, but that's rather heavy-handed and ftputil would need to make sure that the PWD call for the timeout prevention doesn't run in parallel to any other FTP operation on the some connection. This would involve using locks, complicating matters further. I haven't thought about this in detail; there might even be possibilities for deadlocks. All in all, it's probably out of proportion to put such a workaround in ftputil.

Does this help? I'm interested in your feedback, including more ideas to deal with the timeout problem.

Note: See TracTickets for help on using tickets.