Should I use DB to store file ?

Should I use DB to store file ?

This Story Was Originally Published on my [email protected] here.

Image for postStore Files in DB or File System?

I have recently encountered a large scale system which used Oracle DB for storing images, pdf?s, doc?s etc in the database.

First Question which comes to the mind is : Whether we should store files in database ?

Let us first understand pros and cons of either approaches.

Why store in FileSystem ?

  • read/write to a DB is always slower than a file system.
  • your DB backups become huge, which makes restoration & replication slower.(Its a problem where daily size is in TB?s)
  • access to the files now requires going through your Application and Database layers.(It increases the application memory requirements)
  • allows usage of lightweight web-server/CDN/Amazon S3 for serving.
  • cost effective, file servers are much cheaper compared to database.
  • avoids database overhead with its CRUD operations.
  • easier in cases where files (video, audio etc) are to be shared with third party providers.

Then why store files in DB ?

  • DB provides ACID compliance(Atomicity, Consistency, Isolation, Durability) for each row.
  • DB provides data integrity between the file and its metadata.
  • Database Security is available by default.
  • Backups automatically include files, no extra management of file system necessary.
  • Database indexes perform better than file system trees when more number of items are to be stored.
  • Images smaller than 64kb and the storage engine of your db supports inline Blobs (InnoDB puts only 767 bytes of a TEXT or BLOB inline, MyISAM puts inline), it improves performance further as no indirection is required (Locality of reference is achieved, more about this can be read here).
  • File deletion/updation is always in sync with row operations, no extra maintenance needed.

Above list is not a comprehensive list of pros and cons of both DB and File-system, but covers major points. Hence after reading them, I got

Confused?

So which one should I use/prefer ?

File-system or Database ?

I would say it depends on project to project and company to company. Say for e.g when the project was started the need of having a separate file system and its management adds to extra costs, since the no of files will not be huge. Although if usage of files from day 1 is going to high then it makes sense to use a file-system based approach over DB.

Personally I am inclined towards putting files on separate file-system as I prefer to avoid loading file (database blob) in application memory and is more cost effective solution over database.

Hence I will store a simple relative path of files on disk(SFTP server may be) in the database than storing in a BLOB.

Other Alternatives which I explored:

Mongo GridFS(NoSql Database)

When UIDAI (Aadhar) project started in India to provide an identification no to every Indian, they needed to store volumes of bio-metric data of every Indian and they chose Mongo at the start (of which GridFS is at heart for storing images).

One of reasons for its choice is its effective storage of images, let me explain how. Let us put a good quality large size image say 2 MB at-least, we see the following :

$ mongofiles put some-large-image.jpg $ mongo> show collectionsfs.chunksfs.files> db.fs.files.find().pretty();{ “_id” : ObjectId(“5d0c95d94fcf6e4f54027028”), “chunkSize” : 261120, “uploadDate” : ISODate(“2019-06-21T08:31:21.558Z”), “length” : 280540, “md5” : “44596806f22d6144a239aef49dbd8760”, “filename” : “some-large-image.jpg”}

Now where?s my data ? if you do db.fs.chunks.find().pretty() you will get result some thing like ASCII string chunks as below :

{ “_id” : ObjectId(“5d0c95d94fcf6e4f54027029”), “files_id” : ObjectId(“5d0c95d94fcf6e4f54027028”), “n” : 0, “data” : BinData(0,”/9j/4AAQSkZJRgABAgAAZABkAAD/7AARRHVja3kAAQAEAAAAZAAA/+4AJkFkb2JlAGTAAAAAAQMAFQQDBgoNAADVigACHK4AAxfkAARH2v/bAIQAAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQICAgICAgICAgICAwMDAwMDAwMDAwEBAQEBAQECAQECAgIBAgIDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMD/8IAEQgDhAZAAwERAAIRAQMRAf/EAU4AAQACAgMBAQEAAAAAAAAAAAADBwYIAgQFCQEKAQEAAQUBAQEAAAAAAAAAAAAABgECAwUHBAgJEAABBAIBAwIFAwQCAAUFAAAEAAIDBQEGBxAgMEARYDESEwhQcBQhQRUWIhewwDIjJTM0NTYYEQACAgADAwUKCwQECgcFBAsBAgMEABEFIRITMVFhcRQwQYGRMkJScpIGICJigqIjQ3OTJBUQQKFTM2ODo1BgcLGys2SElCXBwtI0RFQ10cN01Bbw06UHsMDipMRFVXW11TYSAAECAgQJBwkGBgAEBQQCAwECAwARIaESBDFRYXGRsSJiIyAwgcHREyQQ8EEycpKisgVA4VLC0jNQYHBCghTxQ1M04mPTFQawwIOTsyVzo8MTAQABAwMDAwQCAwEBAQEAAAERACExQVFhcYGREKGxIDDwwUDRYOHxcFCwwP/aAAwDAQACEQMRAAAB+MspjwAAAAAAAAAAAAAAAAAAAAAAA………

Lets exclude the data field , the response looks like this

> db.fs.chunks.find({},{“data”:0}).pretty();{ “_id” : ObjectId(“5d0c95d94fcf6e4f5402702a”), “files_id” : ObjectId(“5d0c95d94fcf6e4f54027028”), “n” : 1}{ “_id” : ObjectId(“5d0c95d94fcf6e4f54027029”), “files_id” : ObjectId(“5d0c95d94fcf6e4f54027028”), “n” : 0}

So we see we have two documents here, both pointing to the same fs.filesthrough their files_id. The value of n specifies the ordering of the chunks so the data doesn’t get messed up – that’s the GridFS convention, nothing more.

What is GridFS Chunks ?

MongoDB?s document size has an upper limit of 16MB, the idea of chunking is to allow streaming, i.e. allow users to download (or stream) a file without having the server to store the whole thing in RAM.

This mechanism is heavily used in streaming videos, imagine movie of 2 GB HD quality loaded in server?s RAM, that?d be really wastage of precious resources. The default chunk size of GridFS is 256kB which is supposedly a good compromise of overhead (more queries to the database) and little memory use, but it can be configured.

Streaming hardly makes sense for small images and delivering those images will require atleast two to three round-trips to the database instead of one: One to find the fs.files document, and two to get the chunks. Even if we increased the chunk size, we’d still need two round-trips before we can even start to deliver the file.

So to sum it up storing files in mongo provides all the features of mongo like backup/replication/sharding/HA etc. But adds extra overhead of querying one or more collections to retrieve a single chunk.

Further Reading :

  1. Database vs FileSystem
  2. SQL Server ? FileStream Data
  3. MongoDb ? GridFS
  4. MongoDb ? When to use GridFS
  5. Read about Aadhar below :

Questions ? Suggestions ? Comments ?

What?s next? Follow me on Medium to be the first to read my stories.

18