~sschwarzer/ftputil#82: Fix "Internal Server Error"s

schwa (unverified) 9 years ago · edit

In the apache error log, I found the messages
[Mon Jun 09 11:18:05 2014] [warn] [client 79.253.29.118] mod_fcgid: read data timeout in 40 seconds, referer: https://ftputil.sschwarzer.net/trac/wiki/Download?action=edit
[Mon Jun 09 11:18:05 2014] [error] [client 79.253.29.118] Premature end of script headers: trac_ftputil.fcgi, referer: https://ftputil.sschwarzer.net/trac/wiki/Download?action=edit
Although it's possible to increase the timeout, it basically doesn't help a bit. The timeout is still used up, and 40 seconds is too long anyway.

After enabling logging for the Trac instance, I got
2014-06-09 11:34:40,322 Trac[env] ERROR: Exception caught while checking for upgrade: 
Traceback (most recent call last):
  File "/usr/lib/python2.6/dist-packages/trac/env.py", line 664, in open_environment
    needs_upgrade = env.needs_upgrade()
  File "/usr/lib/python2.6/dist-packages/trac/env.py", line 498, in needs_upgrade
    if participant.environment_needs_upgrade(db):
  File "/usr/lib/python2.6/dist-packages/trac/env.py", line 570, in environment_needs_upgrade
    dbver = self.env.get_version(db)
  File "/usr/lib/python2.6/dist-packages/trac/env.py", line 418, in get_version
    (initial and 'initial_' or ''))
  File "/usr/lib/python2.6/dist-packages/trac/db/util.py", line 66, in execute
    return self.cursor.execute(sql)
  File "/usr/lib/python2.6/dist-packages/trac/db/sqlite_backend.py", line 78, in execute
    result = PyFormatCursor.execute(self, *args)
  File "/usr/lib/python2.6/dist-packages/trac/db/sqlite_backend.py", line 56, in execute
    args or [])
  File "/usr/lib/python2.6/dist-packages/trac/db/sqlite_backend.py", line 48, in _rollback_on_error
    return function(self, *args, **kwargs)
OperationalError: database is locked
The ftputil site currently runs Trac 0.12.2 (with Debian's security fixes I assume). This version is supposed to have the "database is locked" problem solved (see http://trac.edgewall.org/ticket/3446).

http://www.sqlite.org/lockingv3.html mentions in the section "How To Corrupt Your Database Files" that SQLite uses advisory locking which might be broken on some Posix systems for NFS mounts. Since the ftputil website is hosted on a "virtual root" server, I have no control over how the file system visible from inside the environment is mounted. This virtual machine mount may be a reason for the problems we see.

Next I'm investigating if I can do anything to improve robustness on the SQLite side. Another - although non-trivial - option would be to migrate to another database system, say PostgreSQL.

schwa (unverified) 9 years ago · edit

By the way, from my experience the following actions cause "Internal Server Error"s relatively often:

Entering tickets (adding comments to existing tickets doesn't seem to be a problem though)

Adding new version numbers in the Trac admin interface.

Requesting ticket reports, especially longer ones.

One other thing that might be useful info: Even if there's an "Internal Server Error", the database operation usually succeeds despite the error message (possibly after a database recovery?). For example, a user acidentally entered a ticket three times, because he got a server error for the first two tries and only the third went through. I've seen the same for changes in the Trac admin interface, for example adding a new version number.

schwa (unverified) 9 years ago · edit

Due to the entries in the traceback (see description) I instrumented /usr/lib/python2.6/dist-packages/trac/db/sqlite_backend.py like this:
...
try:
    import pysqlite2.dbapi2 as sqlite
    have_pysqlite = 2
except ImportError:
    try:
        import sqlite3 as sqlite
        have_pysqlite = 2
    except ImportError:
        have_pysqlite = 0

# NEW
mylog = open("/var/local/trac/ftputil/log/my.log", "w")

if have_pysqlite == 2:
    # Force values to integers because PySQLite 2.2.0 had (2, 2, '0')
    sqlite_version = tuple([int(x) for x in sqlite.sqlite_version_info])
    sqlite_version_string = sqlite.sqlite_version

    class PyFormatCursor(sqlite.Cursor):
        def _rollback_on_error(self, function, *args, **kwargs):
            try:
                # NEW
                print >> mylog, function, args, kwargs
                mylog.flush()
                return function(self, *args, **kwargs)
            except sqlite.DatabaseError:
                self.cnx.rollback()
                raise
...
When inspecting the generated log file my.log after an Internal Server Error, I noticed that the last entry was
<method 'execute' of 'sqlite3.Cursor' objects> ('\n            SELECT name,value FROM session_attribute\n            WHERE sid=? and authenticated=?\n            ', (u'schwa', 1)) {}
I checked the database and found that the table session had nearly half a million rows and session_attribute over a million. A "query" for the indixes gives
sqlite> .schema session
CREATE TABLE session (
    sid text,
    authenticated integer,
    last_visit integer,
    UNIQUE (sid,authenticated)
);
CREATE INDEX session_authenticated_idx ON session (authenticated);
CREATE INDEX session_last_visit_idx ON session (last_visit);
sqlite> .schema session_attribute
CREATE TABLE session_attribute (
    sid text,
    authenticated integer,
    name text,
    value text,
    UNIQUE (sid,authenticated,name)
);
That is, session_attribute doesn't have any index, so the SELECT query from the custom log file might/will take very long.

After checking http://trac.edgewall.org/wiki/TracDev/DatabaseSchema and making a backup of the database, I stopped the webserver and deleted all rows from session_attribute, session and auth_cookie.

After restarting the web server, I could delete bogus wiki page versions easily. (These page versions had been created while changing and saving the Download page, which always gave me an Internal Server Error then. Neither had I been able to delete the old page versions before without causing an Internal Server Error.)

schwa (unverified) 9 years ago · edit

I'm considering adding indixes for sid and authenticated to the session_attribute table to speed up queries even if there are many rows in the table. However, I'll first keep an eye on the table to see whether and possibly how fast it will fill up again.

schwa (unverified) 9 years ago · edit

About five hours later the number of sessions was about 350. I just added indices for the above-mentioned query to session_attribute:
sqlite> create index session_attribute_sid_idx on session_attribute (sid);
sqlite> create index session_attribute_authenticated_idx on session_attribute (authenticated);
For what it's worth, I discovered that the trac-admin command line program has a command to delete sessions older than a specific age:
trac-admin .. help session purge
session purge <age>

    Purge all anonymous sessions older than the given age

    Age may be specified as a relative time like "90 days ago", or in YYYYMMDD
    format.
I might put this into a cronjob.

schwa (unverified) 9 years ago · edit

I tried running the query
SELECT name,value FROM session_attribute WHERE sid='schwa' and authenticated=1;
(basically the "blocked" query from comment 3) on a backup database with the half million sessions.

Actually the query takes about 1 millisecond, both without and with the indexes on session_attribute.

schwa (unverified) 9 years ago · edit

I just tried an experiment:

Stop webserver

Make database backup

Copy the backup of the old database with the big session/session_attribute tables and without indexes over the production database

Restart webserver

Try to modify download page -> got an "Internal Server Error" (as expected)

Stop webserver

Create indexes on session_attribute

Start webserver

Try to modify download page -> got again an "Internal Server Error" (not expected)

(Restored old database and restarted webserver)

So it seems the indexes don't make a difference in the occurence of "Internal Server Error"s, but the sizes of the session tables do make a difference.