Friday, August 03, 2007

Ideas For Blog System File Storage

How should one best store the data for a simple, open blog system? One approach is to use a relational database and SQL queries. That is easy for those familiar with such technology, and allows one to depend on the database layer for concurrency, etc. Others prefer to stick more-or-less plain files into the local filesystem. This is easy for people who don't have database experience or access.

Before I made a decision on this I thought I'd better get a good idea of what I wanted to accomplish. I sketched out an entity-relationship diagram. Since I didn't have the back of a napkin handy, I used paper.

I realised that right away I'd need several tables. POST will hold the articles themselves, but could also be used for news, want ads, or other categories of postings, as described in CATEGORY. PERSON is the user of the system, the article author, etc. To describe what sort of a person, I added a ROLE table. Each role allowed one or more RIGHTs, allowing people to author articles, edit other people's articles, add comments, perform admin functions, and so on.

What is a post without COMMENTs? These attach themselves to a given article and also a person, as author. They must also thread and so require a link back to a previous comment. In order to enable useful searching and grouping, articles need to be tagged with different descriptive labels. However I don't think this is best accomplished with a new table.

With six tables in the database, this seemed too complex to manage in flat files. And what would the advantages of that be, anyway?
1. Works anywhere.
2. Low impact.
3. Simple to extend and diagnose.
4. Human readable.

But on the downside:
1. Concurrency must be manually handled.
2. Performance?
3. Manual search routines required.
4. Potentially messy organisation.

I don't think concurrency is a big issue for the articles themselves, which would rarely be changing. A single-user lock system similar to that used for wikis would be sufficient. Comments might also be handled similarly, but here there is more chance of contention since in a file-based system it would be difficult to handle each comment as an atomic unit. (Unless each comment was in a new file... no thanks.)

When retrieving one article at a time for reading, performance should not be an issue. However aggregate searches ("Give me all of Jean's articles for May") would require some sort of a globbing search over the directory structure. Still, there are fast ways to do that as well.

The need to write the searches in the development language (Python) could be seen as a bonus, since it does away with the need to comprehend SQL. We wouldn't be getting any automatic optimisations, but we wouldn't be needing them with the relatively simple text searches envisaged.

The organisational issue is down to design as much as anything else. Managing a separate database server, configuration files, etc. can get messy too.

So at this point I'm leaning towards a flat-file system, just to keep things very very simple. But this isn't going to work without rethinking the database and simplifying the data model. Third normal form is not going to cut it... we need to get flatter.

First of all I can get rid of CATEGORY and just store that as a text string in POST. For any given site it'll only be one of three or four values. Likewise RIGHT can be incorporated directly into ROLE. I'll abbreviate each right as a single letter. We need PERSON since we have to have a user table in any system that requires a login. And COMMENT must remain as a logical unit.

How will we store these four logical tables? COMMENTs are always read as affixed to the article they are for. Perhaps we can have one file for POST / COMMENT? In order to facilitate searching and to provide pretty URLs, I'd like to have these in folders like these:

The first element of the path corresponds to the category, the next three to the post date, and the final element is the file name (also article title). The file stat itself can be read for the last modification timestamp. This leaves the author, tags, description and content itself to be formatted somehow within the file.

ROLE is simple and fixed enough (once a system is initiated) to store in code as a data structure. That leaves only PERSON to deal with. I suppose it would be consistent to do this, embedding roles as part of the path:

But I'm not sure if that isn't one giant mess for hundreds of users. Maybe a single file with one line per person makes more sense.

And what format for these files? Some sort of an XML-based RSS-friendly format would save time. Atom 1.0 supports relative URIs, XML namespaces, the definition of an XML schema and so on, whereas RSS 2.0 does not. So let's make these files Atom fragments.

At this point I get the sinking feeling I'm merely reinventing a database. Does this design make any sense for a mini-CMS? Are the advantages of a database-free solution worth the disadvantages? What decisions have been made by others facing the same task? Comments please!



Hegga said...

How about using one yaml file for each "table"?

Tim Parkin said...

Personally I'd use a database where possible. After all most databases are highly optimised file system storage after all.. sqlite or postgres would be my recommendation. I've been down the file route and it inevitably ends up painful at some point (unless you've got trivial amounts of data and you don't want to implement a search/filter)

Anonymous said...

I think a file based system can be simplified by delegating functionality in other tools. For example, you can use a VCS for managing users. With git or mercurial, you can handle authentication and permissions outside your application. Post metadata can be handled by a small header section specially formated in each post,for example:
%title: A brillant post
%author: foobar
%tags: foo, bar
%date: 2010/21/33


What about delegating commentsin a service like DISQUS?

Just quick ideas

Sahil said...

Thanks for sharing such valuable information.

Post a Comment