Since a while, certain Trac operations cause an "Internal Server Error" with no helpful information in the logs. Often the operation actually succeeded despite the error message, but sometimes not. Even if the operation succeeds, the messages are annoying for all affected users, including me.
Other websites on the same virtual server work fine.
Find a way, at least a workaround, to avoid the errors.
Ideas:
(As it turned out, it was a database issue, so changing the deployment method wouldn't have helped. It was just coincidence that made it seem that the situation had gotten this bad because of the switch to FastCGI.)
In the apache error log, I found the messages
[Mon Jun 09 11:18:05 2014] [warn] [client 79.253.29.118] mod_fcgid: read data timeout in 40 seconds, referer: https://ftputil.sschwarzer.net/trac/wiki/Download?action=edit [Mon Jun 09 11:18:05 2014] [error] [client 79.253.29.118] Premature end of script headers: trac_ftputil.fcgi, referer: https://ftputil.sschwarzer.net/trac/wiki/Download?action=edit
Although it's possible to increase the timeout, it basically doesn't help a bit. The timeout is still used up, and 40 seconds is too long anyway.
After enabling logging for the Trac instance, I got
2014-06-09 11:34:40,322 Trac[env] ERROR: Exception caught while checking for upgrade: Traceback (most recent call last): File "/usr/lib/python2.6/dist-packages/trac/env.py", line 664, in open_environment needs_upgrade = env.needs_upgrade() File "/usr/lib/python2.6/dist-packages/trac/env.py", line 498, in needs_upgrade if participant.environment_needs_upgrade(db): File "/usr/lib/python2.6/dist-packages/trac/env.py", line 570, in environment_needs_upgrade dbver = self.env.get_version(db) File "/usr/lib/python2.6/dist-packages/trac/env.py", line 418, in get_version (initial and 'initial_' or '')) File "/usr/lib/python2.6/dist-packages/trac/db/util.py", line 66, in execute return self.cursor.execute(sql) File "/usr/lib/python2.6/dist-packages/trac/db/sqlite_backend.py", line 78, in execute result = PyFormatCursor.execute(self, *args) File "/usr/lib/python2.6/dist-packages/trac/db/sqlite_backend.py", line 56, in execute args or []) File "/usr/lib/python2.6/dist-packages/trac/db/sqlite_backend.py", line 48, in _rollback_on_error return function(self, *args, **kwargs) OperationalError: database is locked
The ftputil site currently runs Trac 0.12.2 (with Debian's security fixes I assume). This version is supposed to have the "database is locked" problem solved (see http://trac.edgewall.org/ticket/3446).
http://www.sqlite.org/lockingv3.html mentions in the section "How To Corrupt Your Database Files" that SQLite uses advisory locking which might be broken on some Posix systems for NFS mounts. Since the ftputil website is hosted on a "virtual root" server, I have no control over how the file system visible from inside the environment is mounted. This virtual machine mount may be a reason for the problems we see.
Next I'm investigating if I can do anything to improve robustness on the SQLite side. Another - although non-trivial - option would be to migrate to another database system, say PostgreSQL.
By the way, from my experience the following actions cause "Internal Server Error"s relatively often:
- Entering tickets (adding comments to existing tickets doesn't seem to be a problem though)
- Adding new version numbers in the Trac admin interface.
- Requesting ticket reports, especially longer ones.
One other thing that might be useful info: Even if there's an "Internal Server Error", the database operation usually succeeds despite the error message (possibly after a database recovery?). For example, a user acidentally entered a ticket three times, because he got a server error for the first two tries and only the third went through. I've seen the same for changes in the Trac admin interface, for example adding a new version number.
Due to the entries in the traceback (see description) I instrumented
/usr/lib/python2.6/dist-packages/trac/db/sqlite_backend.py
like this:... try: import pysqlite2.dbapi2 as sqlite have_pysqlite = 2 except ImportError: try: import sqlite3 as sqlite have_pysqlite = 2 except ImportError: have_pysqlite = 0 # NEW mylog = open("/var/local/trac/ftputil/log/my.log", "w") if have_pysqlite == 2: # Force values to integers because PySQLite 2.2.0 had (2, 2, '0') sqlite_version = tuple([int(x) for x in sqlite.sqlite_version_info]) sqlite_version_string = sqlite.sqlite_version class PyFormatCursor(sqlite.Cursor): def _rollback_on_error(self, function, *args, **kwargs): try: # NEW print >> mylog, function, args, kwargs mylog.flush() return function(self, *args, **kwargs) except sqlite.DatabaseError: self.cnx.rollback() raise ...
When inspecting the generated log file
my.log
after an Internal Server Error, I noticed that the last entry was<method 'execute' of 'sqlite3.Cursor' objects> ('\n SELECT name,value FROM session_attribute\n WHERE sid=? and authenticated=?\n ', (u'schwa', 1)) {}
I checked the database and found that the table
session
had nearly half a million rows andsession_attribute
over a million. A "query" for the indixes givessqlite> .schema session CREATE TABLE session ( sid text, authenticated integer, last_visit integer, UNIQUE (sid,authenticated) ); CREATE INDEX session_authenticated_idx ON session (authenticated); CREATE INDEX session_last_visit_idx ON session (last_visit); sqlite> .schema session_attribute CREATE TABLE session_attribute ( sid text, authenticated integer, name text, value text, UNIQUE (sid,authenticated,name) );
That is,
session_attribute
doesn't have any index, so theSELECT
query from the custom log file might/will take very long.After checking http://trac.edgewall.org/wiki/TracDev/DatabaseSchema and making a backup of the database, I stopped the webserver and deleted all rows from
session_attribute
,session
andauth_cookie
.After restarting the web server, I could delete bogus wiki page versions easily. (These page versions had been created while changing and saving the Download page, which always gave me an Internal Server Error then. Neither had I been able to delete the old page versions before without causing an Internal Server Error.)
I'm considering adding indixes for
sid
andauthenticated
to thesession_attribute
table to speed up queries even if there are many rows in the table. However, I'll first keep an eye on the table to see whether and possibly how fast it will fill up again.
About five hours later the number of sessions was about 350. I just added indices for the above-mentioned query to
session_attribute
:sqlite> create index session_attribute_sid_idx on session_attribute (sid); sqlite> create index session_attribute_authenticated_idx on session_attribute (authenticated);
For what it's worth, I discovered that the
trac-admin
command line program has a command to delete sessions older than a specific age:trac-admin .. help session purge session purge <age> Purge all anonymous sessions older than the given age Age may be specified as a relative time like "90 days ago", or in YYYYMMDD format.
I might put this into a cronjob.
I tried running the query
SELECT name,value FROM session_attribute WHERE sid='schwa' and authenticated=1;
(basically the "blocked" query from comment 3) on a backup database with the half million sessions.
Actually the query takes about 1 millisecond, both without and with the indexes on
session_attribute
.
I just tried an experiment:
- Stop webserver
- Make database backup
- Copy the backup of the old database with the big
session
/session_attribute
tables and without indexes over the production database- Restart webserver
- Try to modify download page -> got an "Internal Server Error" (as expected)
- Stop webserver
- Create indexes on
session_attribute
- Start webserver
- Try to modify download page -> got again an "Internal Server Error" (not expected)
- (Restored old database and restarted webserver)
So it seems the indexes don't make a difference in the occurence of "Internal Server Error"s, but the sizes of the session tables do make a difference.