Chasing Linux Kernel Archives

archives

Kernel development is truly impossible to keep track of. The main mailing list alone is vast beyond belief. Then there are all the side lists and IRC channels, not to mention all the corporate mailing lists dedicated to kernel development that never see the light of day. In some ways, kernel development has become fundamentally mysterious.

Once in a while, some lunatic decides to try to reach back into the past and study as much of the corpus of kernel discussion as he or she can find. One such person is Joey Pabalinas, who recently wanted to gather everything together in Maildir format, so he could do searches, calculate statistics, generate pseudo-hacker AI bots and whatnot.

He couldn't find any existing giant corpus, so he tried to create his own by piecing together mail archived on various sites. It turned out to be more than a million separate files, which was too much to host on either GitHub or GitLab. He asked the linux kernel mailing list for suggestions on better hosting opportunities. Although he acknowledged, "It's possible I'm the only weirdo who finds this kind of thing useful, but I figured I should share it just in case I'm not."

Joe Perches suggested plumbing the archives at kernel.org/lore.html, which go back decades. But Joey said he'd tried that, and he found it all but impossible to convert those archives to the Mailbox format he wanted. Instead, he'd spent the previous several weeks scraping the lkml.org archive and scripting his own conversion routines.

Konstantin Ryabitsev remarked:

The maildir format is kind of terrible for LKML, because having millions of messages in a single directory is very hard on the underlying FS. If you break it up into multiple folders, then it becomes difficult to search. This is the main reason why we have chosen to go with the public-inbox format, which solves both of these problems and allows for a very efficient archive updating and replication using git.

Meanwhile, Jasper Spaans raised his eyebrows at Joey's statement that he'd gotten more than a million separate files by scraping lkml.org. Jasper said:

First of all, there are more than 3M messages stored in the lkml.org database, so I guess you've missed some messages or something is really broken. Besides, unless you figured out how to get to the raw data, you've just scraped a rendering which discards stuff like pgp signatures etc and has very incomplete headers. Unless you don't care for those of course.