SQL vs. O-O

Interesting question to ask, at this time and date, is if using SQL DBs make much sense if anything more than very simple object hierarchies are stored into DB. Plain fact is that there is huge mismatch in models used in relational DB and O-O (like java). Mapping Object structures into DB lead very easily into big mess.

For example, with O-O it’s natural that every bit of information is stored into separate tables. Every possible type could get their own table, and all sub-elements in them would get other tables. Now, this is nice and fine with small set of elements, but if amount of elements is increased considerably, then model starts to suffer, since it’s simply not possible to access data efficiently, since querying any element, requires actually making separate queries in multiple tables. Nested object hierarchies easily cause that trying to do any table joins isn’t really feasible, since that will immediately break encapsulation of O-O, and thus will create highly undesired dependencies.

So, is it so that using relational DB is relevant only if stored data is ”relational”, i.e. data model matches strictly DB structure, since that is only case were SQL DB really scales up. With O-O, generally, access pattern doesn’t match at all access pattern for which relational DB (and SQL) is designed for.

As generic fix for this problem is thrown usage of caching in java side. But in reality doing such is really such bad patch for major design problem, which comes from the mismatch of relational vs. object models.

Cannot say, however, there would be either any better solutions right now. Regardless of defects, relational DBs have their own benefits (even if those benefits are more-or-less nuisance from O-O point of view).

In that sense, concepts like Mongo DB are interesting in conceptual level. I wouldn’t start saying this specific DB is holy grail, but few ideas from concept can be borrowed.

For example,
Could it be that structures like used on that specific sample are stored in relational DB. Thus instead of using N -tables to store data for single element, it would be just single blob in one table column in one table (possible few core fields stored as separate columns for quick access). In such data model, retrieval of full object can become blinding fast, single only one DB table is needed to be read. However, since string processing can be rather expensive in java, some benefits can be lost, unless parsing can be delayed as far as possible. For such delayed parsing, those additional quick access fields would also help considerably.

Storage format of such blob column wouldn’t of course be TEXT, but binary, ex. some suitable structural binary format, like Binary JSON or such format, which allows persisting semi-arbitrary data structures in highly efficiently stored format. Such binary format also helps on parsing side considerably, easily giving 10x … 100x fold performance improvement.

Still it would be possible to have different tables (striping data into smaller tables to improve indexing performance), divided by some class type hierarchy or such criterion.

All references between elements would be, of course, implemented as UUID values, thus there wouldn’t be neither much need for DB sequences either. Such data model could be also easily load balanced between multiple servers, since updates are concerning only single rows in DB (like mongoDB and others are demonstrating).

This leads into another step, which is easy versioning support. Basically every update would always just add new row in DB, rows would be never deleted. Even element deletion would be handled as writing new ”deleted” revision of element. This could lead that this server load balancing could actually expand into bi-directional replication, since updates are very atomic.

Regarding such bi-directional replication, however, has one dilemma to resolve, namely record ordering. If two updates are written concurrently at two different servers, then how system can determine order between them. Trusting OS time is out of question, and if assuming that sync between servers is asynchronous (meaning that write in one server cannot depend from availability of another) then using DB sequence doesn’t either do the trick.

Perhaps such ordering issue could be addressed by some kind of token, which determines what elements any server is allowed to update. Servers could negoatiate owner ship of such token with some controller manner. Or possibly it could be configured by administrator, defining explicitly striping of database. Well, I guess so in Mongo DB replication this problem was resolved by not allowing concurrent bi-directional updates. As generic solution, that is likely solution which is used most DB replication solutions, since making something bi-directional, likely requires making assumptions about data model. Which means that generic solution cannot handle such easily.

Of course, if scaling into large amount of data, writing must be possible via all servers, otherwise single server will become bottleneck (and also fragile point of whole system).

References:
Based into several presentations hold at JFokus 2012 at Stockholm
Mongo DB
UUID
Binary JSON
The ”Truth” Behind Unstructured Data

22.2.2012 / java

ma	ti	ke	to	pe	la	su
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Vastaa Peruuta vastaus