My Document Database is a Filesystem
We’re in the process of migrating some data. We have a few tens of thousands of records totaling approximately 1GB. This is a normal data migration process, right?
- Download data from the source APIs.
- Save the records in a database. SQLite is great for this.
- Import them into the upgraded system.
- Profit!
But why the database? Perhaps there is a more simple solution.
The records have a nice recursive structure.
- Conversations
- Messages
- Attachments
- Messages
This hierarchy leans me toward a document oriented storage scheme. However, Why do we always lean on a database for storing our data? Remember the flat-file database? I felt it was time to bring that back.
In Python, there are already modules like FlatPages for Django and Flask, but we’re going for simple.
"""
Maintains the following sturucture:
/database-name
/channels
/channel-id
/messages
/message-id
message.json
/attachments
/attachment-id
attachment.json
asset
"""
class FlatFileDb():
def __init__(self, base_path):
self.db = base_path
def _create(self, p):
os.makedirs(p, exist_ok=True)
return p
def get_channel_path(self, channel_id):
return self._create(os.path.join(self.db, "channels", channel_id))
def get_message_path(self, channel_id, message_id):
return self._create(os.path.join(self.get_channel_path(channel_id), "messages", message_id))
def get_attachment_path(self, channel_id, message_id, attachment_id):
return self._create(os.path.join(self.get_message_path(channel_id, message_id), "attachments", attachment_id))
There are a few simple principles here:
- Directories map collections of objects.
- The Name of each directory is the key or id of that object.
- Each directory can have “sub-objects” which themselves are directories.
- Files are records.
While this system doesn’t require it, you can encode tuple storage into files. Thus a key/value system can be implemented where filenames are the keys and values are the content of the file.
Thus, we now have a system where we can store, and manipulate our data. But can we query it? Absolutely.
How many messages?
-- SQL
SELECT COUNT(*) FROM prod_db.messages
-- Flat File
find db -name message.json | wc -l
Full Text search is easy too. No need to install fancy column types or
extensions. Just use grep
.
What about data archive?
-- SQL
-- Custom proprietary commands
-- Flat File
tar cf database.tar db
Traditional Databases provide a swatch of elegant features including advanced querying, schema to provide data consistency, and indexes to make queries fast. However, sometimes that’s all overkill. Sometimes, just write some files.