Opened 4 years ago

Closed 3 years ago

#96 closed task (fixed)

Explain why we get "latin1-encoded unicode strings" for paths

Reported by: schwa Owned by: schwa
Priority: minor Milestone: 3.3
Component: Documentation Version:
Keywords: ftplib, unicode, bytes, encoding, latin1 Cc:

Description

刘昶 wrote to the ftputil mailing list with the following observation:

test Enviroment:
Server: File Zilla Server 0.9.50 
Client OS: Win7


import ftputil
# Download some files from the login directory.
with ftputil.FTPHost("localhost", user='honglei',passwd='111111' ) as ftp_host:
    names = ftp_host.listdir(ftp_host.curdir)


I find that:
   name[-1] == u'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87.txt', 
it is a 'utf-8' encoded filename, rather than an unicode string.

Change History (2)

comment:1 Changed 4 years ago by schwa

Thanks a lot for bringing this up.

Technically, this is a unicode string, but you're right in that you can see a UTF-8 encoding here.

When you use listdir in ftputil, it uses the standard library's ftplib to retrieve a directory listing. On Python 3, ftplib returns unicode strings. However, since the socket ultimately gets only bytes and ftplib doesn't know the encoding, it arbitrarily assumes latin1 encoding. Since this is an 8-bit encoding, there can't be decoding exceptions.

ftputil processes the strings returned by ftplib as they come, so ftputil in turn gives you those latin1-encoded unicode strings.

Since ftputil uses a unified API for Python 2 and 3, it applies the same unicode handling when run on Python 2.

If you know that the strings use latin1 encoding and if you know that the original encoding coming from the FTP server was UTF-8, you can calculate a unicode string in the correct encoding:

>>> s = u'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87.txt'
>>> s.encode("latin1")
b'\xe8\xbf\x99\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87.txt'
>>> s.encode("latin1").decode("utf8")
'这是中文.txt'

I guess this is the name you expected.

In the general case, i. e. if you don't know the encoding, you can just calculate the byte string by encoding with latin1 as the encoding.

I plan to extend the ftputil documentation to clarify what's going on here.

comment:2 Changed 3 years ago by schwa

Milestone: 3.3
Resolution: fixed
Status: newclosed

I fixed and expanded the documentation section "Directory and file names" in [f6d7fe5a44bb].

Note: See TracTickets for help on using tickets.