Persistency Buffering and Sync

In the persistency library, db_attach/2 has these options:

One of close (close journal after write), flush (default, flush journal after write) or none (handle as fully buffered stream).

Two questions:

  1. For flush and close: I am assuming a “write” is an assert/1 or retract_all/1 statement?
  2. For none: What does “fully buffered stream” mean? I was assuming this meant all changes would be fully buffered and written only on db_detach/0 but my tests show that the file is updated after a query is run, and the docs for db_detach/0 say that the “the file is not affected”. When does the stream get flushed? Between queries?

Finally, if my single-threaded, single process code is the only one doing the persistency, I’ll never need the db_sync/1 call, right? Or do I need to call it periodically for gc of the journal?

1 Like

Yes.

Buffering as done by output streams: bytes are collected in a buffer that is written to disk when it is full. The current buffer size is 4096 (4k). Writes that just go into the buffer will stay there until a new write fills the buffer. There is no time limit. If Prolog terminates it will flush the buffers. It also tries to flush the buffers on a crash, but success cannot be guaranteed in that scenario.

As a result, the journal may miss some actions and the last action (which represented as normal Prolog terms) may be incomplete.

The only thing you may need to do occasionally is to gc the journal. This is not done by default. GC only makes sense if there are retract calls on the persistent predicates.

Whether or not this library is the best way to deal with persistent data depends a lot on your use case:

  • Do you want/need representing data as clauses?
  • How bad is loss of data?
  • How volatile is your data?
  • … I guess a lot more …
1 Like

Buffering options make sense. Thanks.

The data is very simple: atomic triples like data(rock1, isA, rock). No real need for clauses.

Not very, if I’m comparing to transactional database systems, at least. A given query might assert and retract a total of 10 triples, and queries are generated by humans interacting with the system, so the throughput is human speed. Like 1 query per 5 seconds max.

Data loss is not that bad, assuming it is lost consistently. Meaning: There are clear transaction boundaries where the 10ish triples need to be assert/retracted as a unit. Basically the data modified during execution of a single query needs to commit or abort as a unit. Losing all the updates from a query is fine. Having some saved and some lost would be an issue. That’s one of the reasons I was hoping the none option buffered changes during a query and flushed at the end of the query. I was thinking it might provide a kind of simple transaction boundary to guarantee consistency for data modified during the query.

Any other options for a really simple transaction requirement like this? I suppose I could just use a “corrupt flag” that gets set before and cleared after the query and just treat the store as corrupt if it isn’t clear…