~sschwarzer/ftputil#112: 
Erratic `OSError` exceptions in `readinto` in some unit tests

When running the ftputil unit tests with Python 3.5.3 on Fedora 25, some of the tests fail from time to time. These are

test/test_real_ftp.py::TestFTPFiles::test_only_closed_children FAILED
test/test_real_ftp.py::TestFTPFiles::test_no_timed_out_children FAILED
test/test_real_ftp.py::TestFTPFiles::test_no_delayed_226_children FAILED
...
test/test_real_ftp.py::TestOther::test_subsequent_reading FAILED

(output from Pytest). Again, occasionally some of the tests pass. However, when returning now to ftputil after some months, I haven't seen all of the four tests pass. I also saw the failures with Python 3.3.6, compiled from source, although readinto raises an IOError then instead of an OSError (​https://docs.python.org/3/whatsnew/3.5.html?highlight=readinto#io may be related here).

I do not see any of the above test methods fail for Python 2.7.13.

When the failures happen, the traceback reads like this:

___________________________________________________ TestOther.test_subsequent_reading ___________________________________________________

self = <test.test_real_ftp.TestOther object at 0x7fc0a7e20be0>

    def test_subsequent_reading(self):
        # Open a file for reading.
        with self.host.open("debian-keyring.tar.gz", "rb") as file1:
            pass
        # Make sure that there are no problems if the connection is reused.
>       with self.host.open("debian-keyring.tar.gz", "rb") as file2:

test/test_real_ftp.py:880: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ftputil/host.py:203: in open
    host = self._available_child()
ftputil/host.py:167: in _available_child
    host._session.pwd()
/usr/lib64/python3.5/ftplib.py:655: in pwd
    resp = self.voidcmd('PWD')
/usr/lib64/python3.5/ftplib.py:277: in voidcmd
    return self.voidresp()
/usr/lib64/python3.5/ftplib.py:250: in voidresp
    resp = self.getresp()
/usr/lib64/python3.5/ftplib.py:235: in getresp
    resp = self.getmultiline()
/usr/lib64/python3.5/ftplib.py:221: in getmultiline
    line = self.getline()
/usr/lib64/python3.5/ftplib.py:203: in getline
    line = self.file.readline(self.maxline + 1)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <socket.SocketIO object at 0x7fc0a93a7da0>, b = <memory at 0x7fc0a96271c8>

    def readinto(self, b):
        """Read up to len(b) bytes into the writable buffer *b* and return
            the number of bytes read.  If the socket is non-blocking and no bytes
            are available, None is returned.
    
            If *b* is non-empty, a 0 return value indicates that the connection
            was shutdown at the other end.
            """
        self._checkClosed()
        self._checkReadable()
        if self._timeout_occurred:
>           raise OSError("cannot read from timed out object")
E           OSError: cannot read from timed out object

/usr/lib64/python3.5/socket.py:573: OSError

A possibly interesting observation is that all of the above tests seem to take a few seconds to run although they open the file debian-keyring.tar.gz only for reading. This file is about 14 MB large.

The OSError exception from Python's I/O layer in my opinion is different from a timeout where the server sends a 4xx (temporary failure) reply to the client.

Status
RESOLVED FIXED
Submitter
schwa (unverified)
Assigned to
No-one
Submitted
6 years ago
Updated
6 years ago
Labels
bug library

schwa (unverified) 6 years ago · edit

Things to investigate:

  • Are there changes when we use a small file instead of debian-keyring.tar.gz?

  • Why do the tests run so long (several seconds)?

  • Why don't the failures occur for Python 2.7.13 ?

schwa (unverified) 6 years ago · edit

Replying to schwa:

  • Are there changes when we use a small file instead of debian-keyring.tar.gz?

When using a small file (CONTENTS, 641 bytes) none of the four above tests fails within three test runs. Also, the pauses while running each test aren't noticable.

schwa (unverified) 6 years ago · edit

When adding print calls to test_subsequent_reading like this:

    def test_subsequent_reading(self):
        # Open a file for reading.
        import sys
        print("=== before first read", file=sys.stderr)
        with self.host.open("debian-keyring.tar.gz", "rb") as file1:
            pass
        # Make sure that there are no problems if the connection is reused.
        print("=== before second read", file=sys.stderr)
        with self.host.open("debian-keyring.tar.gz", "rb") as file2:
            pass
        print("=== after second read", file=sys.stderr)
        assert file1._session is file2._session

the output shows that the long pause happens between "=== before first read" and "=== before second read". The exception happens almost immediately after printing "=== before second read".

I print to sys.stderr to make sure that the output is unbuffered, i. e. appears as soon as the prints are executed.

schwa (unverified) 6 years ago · edit

I added another print call to test_subsequent_reading:

    def test_subsequent_reading(self):
        # Open a file for reading.
        import sys
        print("=== before first read", file=sys.stderr)
        with self.host.open("debian-keyring.tar.gz", "rb") as file1:
            print("=== just opened", file=sys.stderr)
            pass
        # Make sure that there are no problems if the connection is reused.
        print("=== before second read", file=sys.stderr)
        with self.host.open("debian-keyring.tar.gz", "rb") as file2:
            pass
        print("=== after second read", file=sys.stderr)
        assert file1._session is file2._session

If the test fails, there's the long (several seconds) pause after "=== just opened", After printing "=== before second read" everything else happens quickly. Since "=== before second read" is printed, but "=== after second read" isn't, we can assume the exception happens in the open of the second with statement. This can also be seen in the traceback above.

If the test succeeds (which happens rarely), there's a long pause after "=== before sedond read", but curiously the execution up to his point is quick. We do not see the long pause between "=== just opened" and "=== before second read".

schwa (unverified) 6 years ago · edit

I tried to reproduce the bug in isolation from other tests in the attached ticket112.py. At first, the bug didn't seem to occur at all, but finally showed up in iteration 3660:

Traceback (most recent call last):
  File "./private/ticket112.py", line 43, in <module>
    test_subsequent_reading()
  File "./private/ticket112.py", line 31, in test_subsequent_reading
    with host.open("debian-keyring.tar.gz", "rb") as file2:
  File "/home/schwa/sd/python/ftputil/ftputil/host.py", line 203, in open
    host = self._available_child()
  File "/home/schwa/sd/python/ftputil/ftputil/host.py", line 167, in _available_child
    host._session.pwd()
  File "/usr/lib64/python3.5/ftplib.py", line 655, in pwd
    resp = self.voidcmd('PWD')
  File "/usr/lib64/python3.5/ftplib.py", line 277, in voidcmd
    return self.voidresp()
  File "/usr/lib64/python3.5/ftplib.py", line 250, in voidresp
    resp = self.getresp()
  File "/usr/lib64/python3.5/ftplib.py", line 235, in getresp
    resp = self.getmultiline()
  File "/usr/lib64/python3.5/ftplib.py", line 221, in getmultiline
    line = self.getline()
  File "/usr/lib64/python3.5/ftplib.py", line 203, in getline
    line = self.file.readline(self.maxline + 1)
  File "/usr/lib64/python3.5/socket.py", line 573, in readinto
    raise OSError("cannot read from timed out object")
OSError: cannot read from timed out object

The call chain looks the same as the previous one from running all unit tests.

schwa (unverified) 6 years ago · edit

Since in the unit tests a smaller file seemed to make the failure a lot less likely, I tried to reproduce the failure with a bigger file with exactly 512 MB. I got a traceback in iteration 2411:

Traceback (most recent call last):
  File "private/ticket112.py", line 43, in <module>
    test_subsequent_reading()
  File "private/ticket112.py", line 31, in test_subsequent_reading
    with host.open("debian-keyring.tar.gz", "rb") as file2:
  File "/home/schwa/sd/python/ftputil/ftputil/host.py", line 203, in open
    host = self._available_child()
  File "/home/schwa/sd/python/ftputil/ftputil/host.py", line 167, in _available_child
    host._session.pwd()
  File "/usr/lib64/python3.5/ftplib.py", line 655, in pwd
    resp = self.voidcmd('PWD')
  File "/usr/lib64/python3.5/ftplib.py", line 277, in voidcmd
    return self.voidresp()
  File "/usr/lib64/python3.5/ftplib.py", line 250, in voidresp
    resp = self.getresp()
  File "/usr/lib64/python3.5/ftplib.py", line 235, in getresp
    resp = self.getmultiline()
  File "/usr/lib64/python3.5/ftplib.py", line 221, in getmultiline
    line = self.getline()
  File "/usr/lib64/python3.5/ftplib.py", line 203, in getline
    line = self.file.readline(self.maxline + 1)
  File "/usr/lib64/python3.5/socket.py", line 573, in readinto
    raise OSError("cannot read from timed out object")
OSError: cannot read from timed out object

The last log entries were:

2017-07-24 14:09:00,231:DEBUG:Iteration 2407
2017-07-24 14:09:00,246:DEBUG:=== before first read
2017-07-24 14:09:00,256:DEBUG:=== just opened
2017-07-24 14:09:00,257:DEBUG:=== before second read
2017-07-24 14:09:00,260:DEBUG:=== after second read
2017-07-24 14:09:00,260:DEBUG:Iteration 2408
2017-07-24 14:09:00,275:DEBUG:=== before first read
2017-07-24 14:09:00,294:DEBUG:=== just opened
2017-07-24 14:09:00,295:DEBUG:=== before second read
2017-07-24 14:09:05,303:DEBUG:=== after second read
2017-07-24 14:09:05,303:DEBUG:Iteration 2409
2017-07-24 14:09:05,315:DEBUG:=== before first read
2017-07-24 14:09:05,329:DEBUG:=== just opened
2017-07-24 14:09:05,330:DEBUG:=== before second read
2017-07-24 14:09:10,337:DEBUG:=== after second read
2017-07-24 14:09:10,337:DEBUG:Iteration 2410
2017-07-24 14:09:10,358:DEBUG:=== before first read
2017-07-24 14:09:10,371:DEBUG:=== just opened
2017-07-24 14:09:10,372:DEBUG:=== before second read
2017-07-24 14:09:15,376:DEBUG:=== after second read
2017-07-24 14:09:15,376:DEBUG:Iteration 2411
2017-07-24 14:09:15,392:DEBUG:=== before first read
2017-07-24 14:09:15,411:DEBUG:=== just opened
2017-07-24 14:09:20,417:DEBUG:=== before second read

schwa (unverified) 6 years ago · edit

I can now reproduce the problem much more reliably with the attached script ticket112.2.py (renamed from ticket112.py by Trac).

schwa (unverified) 6 years ago · edit

I went through the function test_only_closed_children, most of the time stepping further into the code (with s in ipdb). I didn't notice anything strange.

After that, I used n to go through test_only_closed_children for 9 calls. That worked without visible problems and everything I saw during the execution seemed as expected.

Only after using c to run the problem regularly, I promptly got an OSError: cannot read from timed out object again.

schwa (unverified) 6 years ago · edit

I added some sleep calls to test_only_closed_children:

def test_only_closed_children():
    import time
    with ftputil.FTPHost(*LOGIN_DATA) as host:
        REMOTE_FILE_NAME = "debian-keyring.tar.gz"
        log.debug("=== before first read")
        #time.sleep(0.1)
        with host.open(REMOTE_FILE_NAME, "rb") as file_obj1:
            time.sleep(0.1)
            # Create empty file and close it.
            log.debug("=== before second read")
            with host.open(REMOTE_FILE_NAME, "rb") as file_obj2:
                time.sleep(0.1)
                pass
            # This should re-use the second child because the first isn't
            # closed but the second is.
            log.debug("=== before third read")
            #time.sleep(0.1)
            with host.open(REMOTE_FILE_NAME, "rb") as file_obj:
                time.sleep(0.1)
                assert len(host._children) == 2
                assert file_obj._host is host._children[1]
            log.debug("=== after third read")

Some sleep calls I started with were successively commented out. The above version of the code ran for 1000 iterations; after that I interrupted the program. The time between iterations was about 0.35 seconds. This makes sense because there are 3 sleep calls of 0.1 seconds each.

schwa (unverified) 6 years ago · edit

I also commented out the sleep call after the first host.open call. After removing the commented-out sleep calls, the function reads:

def test_only_closed_children():
    import time
    with ftputil.FTPHost(*LOGIN_DATA) as host:
        REMOTE_FILE_NAME = "debian-keyring.tar.gz"
        log.debug("=== before first read")
        with host.open(REMOTE_FILE_NAME, "rb") as file_obj1:
            # Create empty file and close it.
            log.debug("=== before second read")
            with host.open(REMOTE_FILE_NAME, "rb") as file_obj2:
                time.sleep(0.1)
                pass
            # This should re-use the second child because the first isn't
            # closed but the second is.
            log.debug("=== before third read")
            with host.open(REMOTE_FILE_NAME, "rb") as file_obj:
                time.sleep(0.1)
                assert len(host._children) == 2
                assert file_obj._host is host._children[1]
            log.debug("=== after third read")

The program is still able to run for 1000 iterations (and more). The time between iterations is 0.25 seconds.

schwa (unverified) 6 years ago · edit

Commenting out the first remaining sleep call (after the second open call) makes the program raise an OSError as in the ticket description. Sometimes this happens in the very first iteration. The log manages to show "=== after third read", so we can presume that the exception happens in FTPHost._available_child in the last FTPHost.open call.

The outputs "=== Iteration 1", "=== before first read" and "=== before second read" happen shortly after one another. On the other hand, the time between "=== before second read" and "=== before third read" is about 5 seconds.

schwa (unverified) 6 years ago · edit

Restoring the first sleep call and commenting out the second makes the function run repeatedly again. The time between iterations is about 5 seconds. I stopped the execution after about 50 iterations.

As it happens, 5 seconds is the timeout constant defined in ftputil/file.py as:

class FTPFile(object):
    """
    Represents a file-like object associated with an FTP host. File
    and socket are closed appropriately if the `close` method is
    called.
    """

    # Set timeout in seconds when closing file connections (see ticket <a href="/~sschwarzer/ftputil/51" title="~sschwarzer/ftputil#51: _FTPFile.close() blocks after reading only a part of a file">#51</a>).
    _close_timeout = 5

and used in close in the same class:

    def close(self):
        """Close the `FTPFile`."""
        if self.closed:
            return
        # Timeout value to restore, see below.
        # Statement works only before the try/finally statement,
        # otherwise Python raises an `UnboundLocalError`.
        old_timeout = self._session.sock.gettimeout()
        try:
            self._fobj.close()
            self._fobj = None
            with ftputil.error.ftplib_error_to_ftp_io_error:
                self._conn.close()
            # Set a timeout to prevent waiting until server timeout
            # if we have a server blocking here like in ticket <a href="/~sschwarzer/ftputil/51" title="~sschwarzer/ftputil#51: _FTPFile.close() blocks after reading only a part of a file">#51</a>.
            self._session.sock.settimeout(self._close_timeout)
            try:
                with ftputil.error.ftplib_error_to_ftp_io_error:
                    self._session.voidresp()
            except ftputil.error.FTPIOError as exc:
                # Ignore some errors, see tickets <a href="/~sschwarzer/ftputil/51" title="~sschwarzer/ftputil#51: _FTPFile.close() blocks after reading only a part of a file">#51</a> and <a href="/~sschwarzer/ftputil/17" title="~sschwarzer/ftputil#17: FTPFile.close() may result in FTPIOException">#17</a> at
                # https://todo.sr.ht/~sschwarzer/ftputil/51 and
                # https://todo.sr.ht/~sschwarzer/ftputil/17,
                # respectively.
                exc = str(exc)
                error_code = exc[:3]
                if exc.splitlines()[0] != "timed out" and \
                  error_code not in ("150", "426", "450", "451"):
                    raise
        finally:
            # Restore timeout for socket of `FTPFile`'s `ftplib.FTP`
            # object in case the connection is reused later.
            self._session.sock.settimeout(old_timeout)
            # If something went wrong before, the file is probably
            # defunct and subsequent calls to `close` won't help
            # either, so we consider the file closed for practical
            # purposes.
            self.closed = True

When I increase the timeout to, say, 8 seconds, the time between iterations grows to 8 seconds. So if we have only the first sleep call, FTPFile.close runs into the timeout.

schwa (unverified) 6 years ago · edit

Attached a script ticket112_reduced.py that still reproduces the problem. The only difference is that whereas the original script showed the OSError in the first iteration, the simplified script may take a few iterations to show the OSError. On my system, the exception usually occurs in the second or third iteration.

schwa (unverified) 6 years ago · edit

As more debugging showed, the trigger of the bug is in the method SocketIO.readinto from socket.py in Python 3 (here the Python 3.5.3 implementation):

    def readinto(self, b):
        """Read up to len(b) bytes into the writable buffer *b* and return
        the number of bytes read.  If the socket is non-blocking and no bytes
        are available, None is returned.

        If *b* is non-empty, a 0 return value indicates that the connection
        was shutdown at the other end.
        """
        self._checkClosed()
        self._checkReadable()
        if self._timeout_occurred:
            raise OSError("cannot read from timed out object")
        while True:
            try:
                return self._sock.recv_into(b)
            except timeout:
                self._timeout_occurred = True
                raise
            except error as e:
                if e.args[0] in _blocking_errnos:
                    return None
                raise

When explaining the bug, I refer to the function main in the recently attached ticket112_reduced.py.

When the first remote file is opened and closed in main (lines 24 and 25), FTPFile.close times out in the call self._session.voidresp (in about the middle of close):

    def close(self):
        """Close the `FTPFile`."""
        if self.closed:
            return
        # Timeout value to restore, see below.
        # Statement works only before the try/finally statement,
        # otherwise Python raises an `UnboundLocalError`.
        old_timeout = self._session.sock.gettimeout()
        try:
            self._fobj.close()
            self._fobj = None
            with ftputil.error.ftplib_error_to_ftp_io_error:
                self._conn.close()
            # Set a timeout to prevent waiting until server timeout
            # if we have a server blocking here like in ticket <a href="/~sschwarzer/ftputil/51" title="~sschwarzer/ftputil#51: _FTPFile.close() blocks after reading only a part of a file">#51</a>.
            self._session.sock.settimeout(self._close_timeout)
            try:
                with ftputil.error.ftplib_error_to_ftp_io_error:
                    self._session.voidresp()
            except ftputil.error.FTPIOError as exc:
                # Ignore some errors, see tickets <a href="/~sschwarzer/ftputil/51" title="~sschwarzer/ftputil#51: _FTPFile.close() blocks after reading only a part of a file">#51</a> and <a href="/~sschwarzer/ftputil/17" title="~sschwarzer/ftputil#17: FTPFile.close() may result in FTPIOException">#17</a> at
                # https://todo.sr.ht/~sschwarzer/ftputil/51 and
                # https://todo.sr.ht/~sschwarzer/ftputil/17,
                # respectively.
                exc = str(exc)
                error_code = exc[:3]
                if exc.splitlines()[0] != "timed out" and \
                  error_code not in ("150", "426", "450", "451"):
                    raise
        finally:
            # Restore timeout for socket of `FTPFile`'s `ftplib.FTP`
            # object in case the connection is reused later.
            self._session.sock.settimeout(old_timeout)
            # If something went wrong before, the file is probably
            # defunct and subsequent calls to `close` won't help
            # either, so we consider the file closed for practical
            # purposes.
            self.closed = True

Apart from the FTPIOError that is raised by the timeout, the SocketIO.readinto code

            try:
                return self._sock.recv_into(b)
            except timeout:
                self._timeout_occurred = True
                raise

sets the instance variable self._timeout_occured to True.

When main executes the second remote access (lines 27 and 28), this executes the call host._session.pwd in FTPHost._available_child. As already shown in previous comments, the lines in SocketIO.readinto

        if self._timeout_occurred:
            raise OSError("cannot read from timed out object")

will now "pick up" the timeout that was remembered by the previous remote file access (specifically, the FTPFile.close call).

This caused me a lot of confusion during debugging because the failure in the second FTPHost.open only happens because the timeout was set in the previous FTPFile.close call. (In hindsight, this maybe isn't so surprising, but it took a bit to get there. ;-) )

The problem doesn't happen in Python 2.7 because it uses a quite different implementation for socket file I/O and it seems Python 2.7 doesn't "remember" the timeout it ran into, so FTPHost.open has no problem reusing the previously timed-out FTP session (ftplib.FTP instance).

To fix the problem in ftputil, the obvious way would be to not only catch ftplib.error_temp and ftplib.error_reply in FTPHost._available_child, but also OSError. However, catching the lower-level exception here together with the higher-level ftplib.error_* exceptions looks odd and I wonder if there isn't a cleaner solution.

schwa (unverified) 6 years ago · edit

For the record, the implementation of SocketIO.readinto in Python 3.1 was

    def readinto(self, b):
        """Read up to len(b) bytes into the writable buffer *b* and return
        the number of bytes read.  If the socket is non-blocking and no bytes
        are available, None is returned.
        If *b* is non-empty, a 0 return value indicates that the connection
        was shutdown at the other end.
        """
        self._checkClosed()
        self._checkReadable()
        return self._sock.recv_into(b)

The timeout handling was added in Python 3.2.

schwa (unverified) 6 years ago · edit

Replying to schwa:

This caused me a lot of confusion during debugging because the failure in the second FTPHost.open only happens because the timeout was set in the previous FTPFile.close call. (In hindsight, this maybe isn't so surprising, but it took a bit to get there. ;-) )

Debugging was difficult because

  • The problem occurs only with specific timing. The use of a debugger can make the bug "disappear." Therefore, I had to use logging/print calls a lot more than I would have liked.
  • Some of the execution goes on in C code that is hidden from the Python debugger.
  • The I/O code uses quite a few inheritance relationships.

For a while, I thought self._timeout_occured may be set from C code in a base class of SocketIO.

schwa (unverified) 6 years ago · edit

I'm thinking of starting a discussion on this on python-dev.

The timeout handling in SocketIO is referenced in

schwa (unverified) 6 years ago · edit

The ​documentation for the ftplib module mentions OSError only in

ftplib.all_errors

The set of all exceptions (as a tuple) that methods of FTP instances may raise as a result of problems with the FTP connection (as opposed to programming errors made by the caller). This set includes the four exceptions listed above as well as OSError.

There's no mention of any exceptions for FTP.pwd.

schwa (unverified) 6 years ago · edit

An experiment in the interpreter confirms that a "normal" idle timeout still raises on error_temp, not an OSError. Otherwise I'd been very surprised because raising an OSError initially would probably break a lot of code that uses ftplib.

The socket timeout that we see in this ticket isn't the idle timeout of the FTP level. After all, the exception happens because the FTP operations happen fast, not because they happen slowly. With a time.sleep(0.1) in the first with block in main, the OSError doesn't occur although from the time between "Iteration n" outputs there is a timeout in each iteration. I wonder how the timeout in this case is different from the timeout that leads to the OSError.

schwa (unverified) 6 years ago · edit

Register here or Log in to comment, or comment via email.