[ZODB-Dev] Optimizing RelStorage Packing for large DBs

Sat Nov 16 02:01:00 CET 2013

I started a new packing script for Relstorage (history free, 
postgresql). It is based on incoming reference counting. Why do we need 
this? First a bit story telling (fast forward down to section "new 
script" if you're bored)

The Problem
-----------

So we had a database with ~298.000.000 objects stored, grown over 4-5 
months. The beast was never packed. We import/sync data on a weekly 
base: A complex Plone based product catalog optimized for web in 27 
languages each ~18.000 products, each with several variants (stored in 
annotated btrees). Lots of stuff is deleted and recreated on import. So 
we had a lot of garbage in DB.

We're using history free Relstorage. Relstorage performs much better on 
concurrent writes and speeds up our import a lot. First on MySQL, now on 
PostgreSQL. Even if PostgreSQL does not solve all problems it seems to 
be the stable choice. At least for one time we had a frozen server. We 
blame MySQL, but well, its a CentOS and there are several other problems 
injecting side-effects (old unpatched kernel, etc, pp, ... we're dev not 
op, its enterprise level and decisions made in past are our constraints, 
carved in stone, more to say?)

Anyway: Our Plone with ZODB Relstorage performs very very well - with 
this many objects stored including all the garbage!

BUT...we want to get rid of the garbage left on every import.

Relstorage deploys with a packaging script. Running it on the ~300 
million objects resulted in RAM-consumption >20GB with 300 days 
estimated time of completion (we did not try if this is true). Phew.

So something had to be done. Storage usage grew about 15-20GB per week. 
As a quick (took some days) fix we decided to try classical Data.fs 
packing (zeopack).

We blocked content managers from editing and stopped imports and cloned 
a snapshot of the virtual maschine with the DB.
At this point we had roughly 160GB in Postgres.
On the clone we converted the Relstorage to a Filesystem Storage. This 
took 40 hours.
Storage shrinked to a 55GB in size Data.fs
Then we started the classical packing and after some hours the DB was 
shrunken to 7.8 GB Data.fs still containing 44 million objects.
Next we converted the Data.fs back to Relstorage (~6 hours) and switched 
live over to the packed DB.

Packing on the 44 million object relstorage with the default script 
shall take several days (3-4 days) only for the prepacking phase. It 
consumes less memory (~5GB) but is still insane.

New Script
----------

After analyzing the current (generic - mysql, pgsql, oracle - history 
and history free) way to pack the DB I found no good way to optimize the 
code here. Other were there before me.

Trial 1:

So I tried to rethink the way packing is done and tried to follow the 
copy-and-switch method the filestorage uses: Copy all in-use zoids from 
object_state to a new table (traverse the graph starting at zoid 0), 
drop the old object_state, rename the new table to object_state. The 
copy process can be distributed. Several workers can do it in parallel 
until the database is at it max transactions/second.

I had this working. 4-6 workers running, a queue with zoids to process 
in DB, a master controlling the process and so on. But even with this 
approach the total time for a pack of the whole DB is after SQL 
optimizations(!) roughly 8 days for a 44 million object graph.

OK - Back to the drawing board.

Trial 2:

On thursday evening I had a good conservation (and some beer) with a 
buddy from the local linux user group and we discussed the problem. The 
outcome was to try it with reference counting.

The idea is simple:

- iterate over all transactions starting with the lowest
   transaction id (tid)
- for each transaction load the object states connected with tid
- for each state fetch its outgoing references and fill a table where
   all incoming references of an object are stored as an array.
   if an state has no references write it anyway to the table with empty
   outgoing references

After reaching the highest tid its easy: all entries with no incoming 
references are garbage. Delete them and remove all incoming references 
of this entry from other entries.

When new transactions are happening its enough to only process 
delta-transactions (newer than the last processed in the reference 
counting table). This is pretty fast if the delta small, i.e. one day 
only. On the deltas another check need to detect references gone 
meanwhile, but since the delta isnt that big this is probably not a problem.

I implemented this roughly now. Building the incoming reference counting 
table from the scratch takes less than 7 hours. Deleting itself is not 
implemented yet, but this is not that difficult :-)

Code
----

The current code is at:
https://github.com/bluedynamics/relstorage_packer

Did I miss something? Any opinions much appreciated!

Expect updates in this thread :)

Jens Klein
-- 
Klein & Partner KG, member of BlueDynamics Alliance