Hi, For some time now we have had trouble with write performance on two ZOPE sites we have in operation. The first one has been in production since early April, while the second is being opened to the public in a few days. They are both based on our internally developed TopicMap engine (built on top of CMF). (To be OpenSourced when we have some available time to finish it up.) The problem has been scarily easy to reproduce, and manifests itself whenever multiple persons are working (writing articles etc). As long as only one person is working on the site, it has the expected performance. Not very fast, a few seconds per write operation, but acceptable. However as soon as more people log in and start publishing, we often experience hangs which may last for 10-15 minutes on write operations that usually takes only a few seconds to perform. In this situation Zope doesn't respond to any requests. Fortunately visitors to the first site are served by the cache (squid on the same machine) and reads from ZOPE have ranged from very fast to mostly quickly. Although we do not have any other ZOPE sites with the same write-load as these sites, the TopicMap engine has naturally been our main suspect. (Load on the first site is about the same as 'zope.org' but with much higher peaks.) It seemed obvious that something was/is with the design of our TopicMap engine that triggers this. We have been reviewing our own code and made many optimizations that we expected to yield significant speedups. While they did make the site faster on average, write operations still often result in long hangs when multiple users were working on either of the sites. We followed the procedure in http://www.zope.org/Members/4am/debugspinningzope. The hang is indeed a spinning process (thread) as it always uses 99.9% CPU when Zope stops responding; attaching to the main process in a hang situation and looking at the responsible thread invariably shows it to be in chunk_free() in libc somewhere downstream from pickleCache. Zope will usually "unhang" itself after about 10-15 minutes of spinning. Mr. Kromers comments on ZOPE and SMP in the 'system requirements' thread a couple of weeks ago gave us a few clues, which we followed. We tried binding the ZOPE instance to one CPU using the affinity patch for Linux 2.4, but that did not help either. We then tried disabling one CPU, suspecting SMP trouble, but still no go. So slowly ZOPE became the suspect. Tried different number of threads, 20, 10, 4, 2... then we we tried running with only one thread, and the "hang" problems vanished! What's more; finally we were getting the performance we expected. It works very well now, even under relatively high load and a lot of write activities. The hardware and software of the two sites: Site 1: Site 2: A dual Xeon 1.2 GHZ A dual Xeon 1.0 GHZ 2 GB ram 512 MB ram RAID 1+5 RAID 1+5 CVS ca. ZOPE 2.5.0 ZOPE 2.5.1 CVS ca. CMF 1.2 CMF 1.2 Python 2.1.2 (PThreads) Python 2.1.3 Linux 2.4.14 (XFS) Linux 2.4.18 (ext3) Python is compiled with thread support and large file support. for both sites The database for site 1 is approx 450 MB newly packed - ca. 220 000 objects; site 2 isn't live yet and is much smaller (~100MB packed). We're experiencing the same problems on both sites. The second site will be going live in a few days. Due to its design and requirements we are not able to cache as much content. This forces us to look for a different solution; the one-thread option won't cut it; too many requests will have to go outside the cache... Currently we are running this on a single thread, but we expect that to "kill us" once it is opened to the general public :( A possible workaround is of course to run two different ZOPE instances with a ZEO backend. One with multiple threads for reads and visits, the other with a single thread for writing and publishing. This is perhaps the ideal solution, but we are loath to make untested changes to the production environment just before going live. Secondly this feels like a bug in ZOPE or Python, and if it is we would like to track it down. What we are looking for is information on threading in Python/ZOPE/ZODB, other peoples experiences, workarounds, etc. Regards, Arnar Lundesgaard ---------------------------------- phone: (+47) 982 38 036 mailto:arnar.lundesgaard(a)creuna.no Creuna as Bryggegata 3 NO-0250 Oslo phone office: (+47) 23 23 88 00 fax: (+47) 23 23 88 50 http://www.creuna.no/