I think that are three scenarios here (maybe more):
- storing triples for ontologies, graphs. etc. (e.g., RDW, OWL, …)
- storing denormalized data, ideally where the columns compress nicely and the data are typically processed in (primary) key order
- conventional row-oriented data (especially if normalized or almost normalized).
#2 is the one where column-oriented databases work best, although they can be used for all situations (e.g., Google’s Bigtable is column-oriented). This works with both (primary) key-value (no-SQL) or multiple indexes; the latter are essentially tables that map values to the (primary) keys.
The original reasons for column databases was based on slow seek time on disks and relatively fast scanning through the data when read into memory. So, taking Bigtable as an example, the data are segmented into “shards” and each shard is a range of keys, stored in an sstable, as a set of key-value pairs. The data can be compressed, so accessing the data for a specific key may require a scan of the data (decompressing it), but the amount of scan is limited to the size of a single sstable or a chunk within the sstable (e.g., 64K).
[A detail: sstables are write-once, so there’s actually a “stack” of sstables for each shard of the data; each “stack” of sstables is periodically reprocessed into a single sstable. This is somewhat reminiscent of PostgreSQL’s way of storing data that minimizes write locks.]
Each column in the database is indexed by the primary key. The indexes at the Bigtable level point to shards (sstables); the assumption is that the in-memory processing of an sstable (including decompressing the data) is much faster than the IO. For a purely in-memory database, it is less clear than compressing the columns is worthwhile. If the data are typically processed in primary key order, it’s probably worth compressing the column data.
For case #1, it’s not clear that column database has any advantage, either in performance or space, especially if the data are atoms, whose representation is already a kind of compression (atoms are represented as integers, not strings); I suspect that a conventional row-oriented data structure would actually work better here. Similarly for case #3.
So, if we want a more compact data structure for large tables containing mostly atoms, something similar to a conventional row-oriented database is probably best. If specific needs arise - a column-oriented data format might be added later.
If we’re talking about data that’s too big to fit in RAM, we should probably also consider a way of distributing the data over multiple nodes. For that, something like MonetDB might make more sense.
For more on the advantages/disadvantages of column data, here’s a reasonable overview: Column-oriented DBMS - Wikipedia