\documentclass[twocolumn,a4paper,10pt]{article}
\usepackage{a4wide}
\usepackage{float}
\usepackage{ifthen}

\floatstyle{ruled}
\newfloat{Figure}{h}{lop}
\floatstyle{boxed}

\newcommand\debbugs{{\tt debbugs}}
\newcommand\bs{$\backslash$}

\title{\debbugs{} -- Tips, Tricks and Hacks\footnote{Copyright
\copyright{} 2005 Anthony Towns. This document may be redistributed
and/or derived from under the terms of the Creative Commons
Attribution-ShareAlike license, version 2.5.}} \author{Anthony Towns}
\date{15th June, 2005}

\begin{document}
\maketitle

\begin{abstract}

\footnotesize

This paper aims to serve as a useful reference for people attending the
talk of the same title at DebConf~5, to be held in Helsinki, Finland from
the 9th to the 16th of July 2005. It summarises the primary motivations
behind the design philosophy of \debbugs{}, the on-disk data formats
\debbugs{} uses, and the overall structure of the code. It aims to
provide sufficient background on the current status of the \debbugs{}
codebase that the interested reader may use as a basis for beginning
to hack on the \debbugs{} codebase. Basic familiarity with \debbugs{}
from a user's perspective is assumed.

\end{abstract}

\section{Introduction}

The Debian bug tracking system (BTS) software, \debbugs{}, is the product
of over ten years of patch accumulation. It was initially coded in 1994
by Ian Jackson, and has been developed since then by a rogues gallery of
hackers including Darren Benham, Adam Heath, Josip Rodin, Colin Watson,
Robert Larson, and your author. Despite its use by nCipher and for a
while by both Gnome and KDE, \debbugs{} has remained primarily focussed
on being satisfactory as the Debian BTS rather than adding features to
ensure its utility as a general-purpose bug tracking tool. As a result,
\debbugs{} has lagged behind other bug-trackers in a variety of areas,
but by and large remained superior to competitors in the areas most
important to Debian.

The most important requirement for a BTS to function for Debian is an
efficient interface, optimised for the habits of Debian developers. An
email interface is thus expected in order to allow efficient notification
and (especially distant and disconnected) manipulation of bugs. A web
interface that provides easy anonymous browsing of bug information is also
important. Efficient programmatic access to bug reports is also desired,
though is not as crucial a requirement as email-based notification and
control or web access.

Debian has a very clear line of demarcation for categorising bugs:
the package. This is somewhat unique, in that for many bug trackers
this is either too coarse a distinction (particularly for upstreams who
may only deal with one or two packages in any case, or who may want to
assign particular feature areas to particular developers), or too fine a
distinction (eg, where developers don't have strict lines of ownership),
or simply entirely irrelevant (eg, where the interesting issues to track
are feature implementations that cross ownership boundaries). As such,
most BTS software goes to greater lengths than \debbugs{} in offering
hierarchial categorisations for bugs, while \debbugs{} is mostly
focussed on providing an efficient view for maintainers, and dealing
with occassional cross-package issues such as release critical bugs,
installation issues, and policy development.

While Debian doesn't have a lot of expectations from its BTS, it does
expect it to meet those expectations solidly: at the time of writing,
Debian's instance of \debbugs{} is tracking over 55,000 active bugs,
231,000 archived bugs, and has dealt with more than 314,000 bug reports
over its lifetime to date. Further, approximately a thousand new bug
reports are filed each week at present. The \debbugs{} webpages reflect
status changes and updates as soon as they're processed, and downtime
of even a few minutes is rare. While the asynchronous nature of email
serves to mitigate the consequences of downtime or delays in processing
updates and bug status changes (both by managing developer expectations
of a response, and by allowing for delayed commands to complete anyway),
downtime or delays that last more than a few hours are likewise frowned
upon and cause difficulties. This requirement is especially difficult to
work around, since the global nature of Debian doesn't provide extended
slack periods. The end result is that this requires most changes to
be incremental, so that they can be applied directly to the installed
system as it continues to run and any transitional measures must be a part
of the main \debbugs{} software.

Debian also expects its BTS to serve as a permanent public record of
activities and discussions, and additionally expects it to be trivially
easy for anyone to participate in the discussion and analysis of
bugs. This limits the amount of authentication \debbugs{} can require,
and has so far resulted in \debbugs{} doing no user authentication,
instead relying on the good will of the user community and a policy of not
allowing irreversible changes. With the exception of an increasing problem
with spam, this has worked surprisingly well. It has led \debbugs{}
in a different direction to most other BTS software, however, which
usually provide accounts and limit modifications to registered users.

In summary, the core requirements underlying \debbugs{} are:

\begin{itemize}
\item Familiar interface -- it must allow developers to interact with the
system using email, and provide all the data to anyone with a web browser.

\item Package based -- it must quickly and easily handle separating bug
reports by package.

\item Scalability -- it must cope with a large number of bug reports.

\item Immediacy -- it must provide immediate feedback on the state of bugs,
and handle requested changes promptly.

\item Reliability -- it must continue working at all times, even as new
features are trialled.

\item Public -- it must provide a permanent public record of discussions,
and allow the members of the wider Debian community to participate in
those discussions with absolutely minimal effort.
\end{itemize}

\section{Data Format}

Despite a number of attempts to convert \debbugs{} to use an RDBMS
backend, the \debbugs{} data storage format remains stubbornly text based.

\begin{Figure}
\caption{Spool Layout}
{\footnotesize
\begin{verbatim}
/org/bugs.debian.org/spool/
  incoming/
    T.*
    S[BMQFDU RC]*.*
    R[BMQFDU RC]*.*
    I[BMQFDU RC]*.*
    G[BMQFDU RC]*.*
    P[BMQFDU RC]*.*

  db-h/
    00/
      ...
      314200.log
      314200.report
      314200.status
      314200.summary
    ...
    99/
  archive/
    00/ .. 99/

  index.db       -> index.db.realtime
  index.archive  -> index.archive.realtime
  nextnumber
\end{verbatim}
}
\end{Figure}

The interesting action all takes place in the \debbugs{} spool directory
(see Figure~1), in particular three key subdirectories: {\tt db-h},
{\tt archive}, and {\tt incoming}. Also in the spool directory are a
couple of simple index files for the web interface, and a file caching
the next unassigned bug number.

The {\tt db-h} and {\tt archive} directories are hashed based on the
last two digits of the bug number in order to reduce the number of
files in any one directory to a more manageable amount -- the {\tt db-h}
directory is so named so that it could be implemented while retaining the
old, unhashed, {\tt db} directory for a transitional period. Fortunately
improvements in the Linux kernel and the hardware Debian runs \debbugs{}
on have made this hashing less necessary. However, it may yet be necessary
to add another level of hashing may be needed as \debbugs{} accumulates
ever more bug reports.

Each individual bug is represented by four files. The summary file and
the log file are the most important, respectively storing metadata about
the bug (its current severity, who filed it, which package it is filed
against, etc) and a log of all the emails send to the bug. The status
file was the original version of the summary file, and is only being
preserved for script compatability. The report file is a copy of the
initial email that opened the bug, which will be sent out when the bug
is closed, along with the closing message.

\subsection{Incoming}

Incoming emails are dumped into the {\tt incoming} directory by the mail
server (via the {\tt receive} script) according to a specific naming
scheme. The first letter indicates how far the mail has made it through
the incoming process, as per Figure~2. When the scripts crash, which given
the number of mails they process isn't as infrequent as we might like,
some of these files will be left lying around in the incoming directory.

\begin{Figure}
\caption{States During Incoming Processing}
{\footnotesize
\begin{description}
\item[{\tt T}] being spooled by {\tt receive}
\item[{\tt S}] waiting to be checked for spam
\item[{\tt R}] currently being checked for spam
\item[{\tt I}] passed spam check, awaiting processing
\item[{\tt G}] passed on to {\tt service} or {\tt process} script
\item[{\tt P}] being processed
\end{description}
}
\end{Figure}

The second letter of the filename indicates to \debbugs{} which address
the mail was sent to, as per Figure~3. For email addresses that include a
bug number (such as {\tt 123456@bugs.d.o} or {\tt 123456-done@bugs.d.o}),
the relevant bug number follows the letter. The remainder of the filename
consists of a dot and a unique id calculated when the mail is received,
presently consisting of a timestamp and pid.

\begin{Figure}
\caption{Addresses Handled by {\tt process}}
{\footnotesize
\begin{description}
\item[{\tt B}] Normal bug submissions ({\tt submit@}, {\tt 1234@})
\item[{\tt M}] Don't send to mailing lists ({\tt -maintonly})
\item[{\tt Q}] Only store in the BTS ({\tt -quiet})
\item[{\tt F}] Bug is forwarded upstream ({\tt -forwarded})
\item[{\tt D}] Bug is dealt with ({\tt -done})
\item[{\tt U}] Forward mail to submitter ({\tt -submitter})
\item[{\tt R}] User's request interface ({\tt request@})
\item[{\tt C}] Developer's control interface ({\tt control@})
\end{description}
}
\end{Figure}

\subsection{Status and Summary files}

The status format is line based, with each line given a specific
meaning. If the file does not have all ten lines, the last fields are
taken as being empty. As noted previously, this is a legacy format that
is in the process of being phrased out. See Figure~4 for details on the
interpretation of this format.

\begin{Figure}
\caption{{\tt .status} Format Interpretation}
{\footnotesize
\begin{enumerate}
\item Submitter email address
\item Date in seconds since the epoch
\item Subject
\item Message-ID of original report
\item Package(s) bug is assigned to
\item Tags
\item Email address of bug closer (if closed)
\item Email address or url of upstream (if forwarded)
\item Bugs this bug is merged with
\item Severity of bug
\end{enumerate}
}
\end{Figure}

The status format is being phased out in favour of the summary file
format, which is formatted in the usual, Debian-favoured, RFC822 format,
and as such is pleasantly extensible. There are currently two versions of
the summary file format, indicated using the {\tt Format-Version:} field:
version {\tt 2} and {\tt 3}. The difference between the two versions is
that version {\tt 3} stores the submitter, subject, closed-by, forwarded
and owner fields in RFC1522-decoded format (the status format is treated
as version {\tt 1}, so {\tt Format-Version:~1} is self-contradictory). The
fields currently recognised by \debbugs{} are listed in Figure~5.

\begin{Figure}
\caption{Summary Format Interpretation}
{\footnotesize
\begin{description}
\item[{\tt Format-Version:}] Version of file, either {\tt 2} or {\tt 3}
\item[{\tt Submitter:}] Submitter email address
\item[{\tt Date:}] Date in seconds since the epoch
\item[{\tt Subject:}] Subject
\item[{\tt Message-ID:}] Message-ID of original report
\item[{\tt Package:}] Package(s) bug is assigned to
\item[{\tt Tags:}] Tags
\item[{\tt Done:}] Email address of bug closer (if closed)
\item[{\tt Forwarded-To:}] URL or email address of upstream (if forwarded)
\item[{\tt Merged-With:}] Bugs this bug is merged with
\item[{\tt Severity:}] Severity of bug
\item[{\tt Owner:}] Owner of the bug
\end{description}
}
\end{Figure}

\subsection{Log files}

While the crufty old {\tt .status} files have a shiny new replacement,
the same isn't true of the venerable \debbugs{} {\tt .log} files. The
log files are an append-only record of every mail sent to a particular
bug, accompanied by some metadata to indicate where the mail was sent,
or, in the case of mails to {\tt control@} or {\tt nnnn-done@}, what
actions resulted from the mail.

Unfortunately, the "metadata" is just the raw HTML notes included in
the web pages, which isn't amenable to translation or parsing. Further,
as the messages have changed over the history of \debbugs{}, the text
in old bugs has not been updated, making it difficult to offer any
flexibility in how bugs are displayed.

Even beyond this, the log format is rather arcane. It consists of
a sequence of blocks of data, separated by control characters on a
line of their own. Transitions between states are limited, and escape
characters in received mails need to be escaped out as well (done by
prefixing {\tt \\030} to worrisome lines). The states are summarised in
Figure~6 including the control code that indicates a transition into
that state; but mostly you're better off looking at POD documentation
in {\tt Debbugs::Log} or at the parsing code directly.

\begin{Figure}
\caption{Log file states}
{\footnotesize
\begin{description}
\item[{\tt kill-init}] No lines have been processed yet

\item[{\tt incoming-recv}] Received: line from incoming emails, followed by {\tt go} state ({\tt \bs{}07})
\item[{\tt autocheck}] Miscellaneous (ignored) lines up to an X-Debian-Bugs..:~autoforward line, followed by {\tt autowait} state ({\tt \bs{}01})
\item[{\tt html}]      Raw \debbugs{} generated HTML to be displayed verbatim ({\tt \bs{}06})
\item[{\tt recips}]    Recipients of mail, separated by {\tt \bs{}04} characters ({\tt \bs{}02})

\item[{\tt go}]        Lines of text from email ({\tt \bs{}05})
\item[{\tt go-nox}]    Lines of text from email, preceeded by an "X" character

\item[{\tt kill-end}]  End of a complete message, possibly end of log ({\tt \bs{}03})

\item[{\tt autowait}]  Miscellaneous (ignored) lines up to a blank line, followed by {\tt go-nox} state.
\end{description}
}
\end{Figure}

\subsection{Index files}

There are two fairly simple indexes used by the {\tt pkgreport.cgi} script
to determine which bugs are relevant for a given package, submitter,
tag or severity; one for active bugs, and one for archived bugs. These
are very simplistic text files, with one line per bug containing some
of the basic summary information about that bug. For simplistic analyses
of bug information, these indices are often sufficient.

The format of the file is very simple. It consists of the package name,
the bug number, the bug date in seconds since the epoch, the bug state
(open, forwarded or done), the submitter email enclosed in square
brackets, the severity, and any tags. Each field is separated by spaces,
but note that the email address is user supplied, and may contain both
spaces and closing square brackets.

In the past, the \debbugs{} CGIs had a couple of more efficient
indices available for its use, namely {\tt by-package.idx} and {\tt
by-severity.idx}, which were DB files that provided constant time lookup
for all the bug numbers relevant for a particular package or severity.
Unfortunately the generation of these indices was never mainlined into
\debbugs{} proper. Fortunately, however, the current \debbugs{} server
Debian uses is fast enough that it took months before anyone even noticed
that the faster indices were no longer being used.

The index files aren't extensible, and are not particularly efficient
or even particularly easy or safe to use.

\section{Code Structure}

As you would expect of code whose major design strategy is patch
accumulation, the \debbugs{} code is noteworthy not so much for its
structure, as its lack thereof. The simplest dividing line is between
the core \debbugs{} scripts that process bug emails and maintain the
bug database, and the CGI scripts used for viewing the bug data.

All the scripts take their configuration information from files in {\tt
/etc/debbugs}, though in some cases items that should be configurable
(and in particular that should be changed for non-Debian uses) are not.

\subsection{Core Scripts}

The core scripts are listed in Figure~7. The scripts are reasonably
well separated by functionality, and data passed around is limited to
files in the incoming spool, and the bug database itself, leading to
reasonably loose coupling. Each of the scripts does, however, import the
{\tt errorlib} functions, which provides common functionality, such as bug
locking, additions to the bug logs, parsing of summary files, and updating
of the index files. {\tt errorlib} as it is currently used is impressively
poorly named, and fails a number of common expectations of clean code,
such as by the bug summary query function using global variables to
communicate with its caller instead of simply returning a hash or array.

\begin{Figure}
\caption{Core Scripts}
{\footnotesize
\begin{description}
\item[{\tt errorlib}]    function library for scripts
\item[{\tt receive}]     receive an email from the MTA
\item[{\tt spamscan}]    spam check a received mail
\item[{\tt processall}]  distribute incoming mails to {\tt process} and
                         {\tt service}
\item[{\tt process}]     process a bug mail
\item[{\tt service}]     process a {\tt control@} or {\tt report@} mail
\item[{\tt expire}]      expire closed bugs after 28 days
\item[{\tt rebuild}]     rebuild index files
\end{description}
}
\end{Figure}

Except for {\tt receive} and {\tt rebuild}, all the scripts are invoked
from {\tt cron}, either directly (such as {\tt spamscan} and {\tt
expire}), or indirectly (such as {\tt process} and {\tt service}). This
does lead to some delays, notably up to a fifteen minute delay in
processing incoming mail due to the frequency of {\tt processall}
invocations, but it also effectively rate limits the BTS in cases such
as when it decides to get caught in a loop of repeatedly replying to
itself or another BTS, logging each message under some poor bug report.

\subsection{CGI scripts}

The CGI scripts are largely distinct from the core scripts, though they
do make use of some of the {\tt errorlib} functions. The CGI scripts are
listed in Figure~8. By and large, the HTML output is hardcoded within the
CGI scripts, as is some of the special handling of particular severities
or tags, and as a result, the CGI scripts are in many cases the worst
example of Debian-specificity within \debbugs{}.

\begin{Figure}
\caption{CGI scripts}
{\footnotesize
\begin{description}
\item[{\tt bugreport.cgi}] Display the contents of a single bug report
\item[{\tt pkgreport.cgi}] Summarise bug reports for a package, submitter, etc
\item[{\tt pkgindex.cgi}]  List packages, severities, etc with a count of bugs
\item[{\tt common.pl}] CGI-relevant helper functions
\end{description}
}
\end{Figure}

Options are passed to the CGIs through parameters, such as {\tt
\&reverse=yes} to obtain the old-style reversed log format for bug
reports (most recent information first). These parameters cover both
user customisation (such as {\tt \&reverse=yes}) and search customisation
(such as {\tt \&exclude=wontfix}).

{\tt bugreport.cgi} is relatively straightforward, except for the code
that needs to parse the {\tt .log} format. {\tt bugreport.cgi} also
provides a facility for downloading all mails related to a bug in mbox
format, and to download a specific attachment from a mail.

By contrast, {\tt pkgreport.cgi} and {\tt pkgindex.cgi} both have fairly
straightforward tasks. The factor that makes implementation difficult is
that they need to do them well, and that the number of bugs Debian expects
them to deal causes both speed and convenience to become problems. {\tt
pkgreport.cgi} in particular is the primary interface through which
developers access \debbugs{}, and as such has to be both responsive and
relatively easy to navigate.

\subsection{Hacking}

The \debbugs{} source is available in CVS\footnote{% 
{\tt CVSROOT=:pserver:anonymous@cvs.debian.org:/cvs/debbugs},
module {\tt source}}. Debian developers can also find a full copy
of Debian's \debbugs{} install in {\tt /org/bugs.debian.org} on {\tt
merkel.debian.org}.

This paper is not complete documentation of the \debbugs{} source, please
don't mistake it for such. If you choose to hack on \debbugs{} you do
so at your own risk, and you may wish to consult a licensed attorney to
assist in preparing a will. Your author and the Debian project disclaim
all responsibility from any lost time, productivity or limbs that may
result from any activity related to or inspired by this document.

\end{document}
