Hi, [modified slightly from a similar proposal to zope3-dev to match Zope 2's publisher] I'm writing up a proposal for the ZODB to make even more efficient Blob handling possible. This includes not copying the data from an uploaded file, but using a `link` operation when possible. However, the HTTPRequest class currently uses the default implementation of the cgi module's FieldStorage. I propose to create a small subclass to override the `make_file` method to use `NamedTemporaryFile` instead of `TemporaryFile` to allow the file being accessible from a filename so I can apply a `link` operation. Notice: The FieldStorage explicitly provides the `make_file` method to allow overriding in this sense. Does anybody feel like this would be a bad idea? Christian -- gocept gmbh & co. kg - forsterstraße 29 - 06112 halle/saale - germany www.gocept.com - ct@gocept.com - phone +49 345 122 9889 7 - fax +49 345 122 9889 1 - zope and plone consulting and development
What exactly do you mean by 'link'? As in 'soft links'? The uploaded file usually is a temporary file, so you are saying you would create a soft link on the 'blobs' directory to a file in the $TMP directory? Or maybe the other way around? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
Am Mittwoch, den 07.03.2007, 14:01 -0300 schrieb Sidnei da Silva:
What exactly do you mean by 'link'? As in 'soft links'? The uploaded file usually is a temporary file, so you are saying you would create a soft link on the 'blobs' directory to a file in the $TMP directory? Or maybe the other way around?
No, I'd create a new hard link into the blob directory so the link to the temporary file can go away without making the inode go away. For the purposes of storing blobs the TMP directory should be on the same partition as the blobs directory anyway. Christian -- gocept gmbh & co. kg - forsterstraße 29 - 06112 halle/saale - germany www.gocept.com - ct@gocept.com - phone +49 345 122 9889 7 - fax +49 345 122 9889 1 - zope and plone consulting and development
On Wed, Mar 07, 2007 at 06:31:25PM +0100, Christian Theune wrote:
Am Mittwoch, den 07.03.2007, 14:01 -0300 schrieb Sidnei da Silva:
What exactly do you mean by 'link'? As in 'soft links'? The uploaded file usually is a temporary file, so you are saying you would create a soft link on the 'blobs' directory to a file in the $TMP directory? Or maybe the other way around?
No, I'd create a new hard link into the blob directory so the link to the temporary file can go away without making the inode go away. For the purposes of storing blobs the TMP directory should be on the same partition as the blobs directory anyway.
I like the idea, but what will you do if this fails? (eg. the admin has put TMP on a different mount, or we're running on Windows). -- Paul Winkler http://www.slinkp.com
Christian Theune-2 wrote:
Am Mittwoch, den 07.03.2007, 14:01 -0300 schrieb Sidnei da Silva:
What exactly do you mean by 'link'? As in 'soft links'? The uploaded file usually is a temporary file, so you are saying you would create a soft link on the 'blobs' directory to a file in the $TMP directory? Or maybe the other way around?
No, I'd create a new hard link into the blob directory so the link to the temporary file can go away without making the inode go away. For the purposes of storing blobs the TMP directory should be on the same partition as the blobs directory anyway.
Does this work on Windows? Martin -- View this message in context: http://www.nabble.com/Proposal-for-optimized-Blob-handling-tf3363320.html#a9... Sent from the Zope - Dev mailing list archive at Nabble.com.
Am Mittwoch, den 07.03.2007, 09:34 -0800 schrieb Martin Aspeli:
Christian Theune-2 wrote:
Am Mittwoch, den 07.03.2007, 14:01 -0300 schrieb Sidnei da Silva:
What exactly do you mean by 'link'? As in 'soft links'? The uploaded file usually is a temporary file, so you are saying you would create a soft link on the 'blobs' directory to a file in the $TMP directory? Or maybe the other way around?
No, I'd create a new hard link into the blob directory so the link to the temporary file can go away without making the inode go away. For the purposes of storing blobs the TMP directory should be on the same partition as the blobs directory anyway.
Does this work on Windows?
Link does not work on Windows using the link() function from the os module. I don't know whether Windows has any API for doing this kind of operation. In any case we can fall back (e.g. if the link() call fails) to copying the data as this is just an optimization. Please consider looking at my upcoming proposal for discussion. In this thread I'd like to keep the focus on the change of the publisher to use NamedTemporaryFile. Christian PS: Thanks for the input though. -- gocept gmbh & co. kg - forsterstraße 29 - 06112 halle/saale - germany www.gocept.com - ct@gocept.com - phone +49 345 122 9889 7 - fax +49 345 122 9889 1 - zope and plone consulting and development
On 3/7/07, Christian Theune <ct@gocept.com> wrote:
Does this work on Windows?
Link does not work on Windows using the link() function from the os module.
I don't know whether Windows has any API for doing this kind of operation.
Yes [1] it [2] does [3]. The omission on os.link() is just the lack of a good soul to contribute it apparently [4]. A good task for a friday evening. [1] http://msdn2.microsoft.com/en-us/library/aa363860.aspx [2] http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-... [3] http://www.flexhex.com/docs/articles/hard-links.phtml [4] http://www.thescripts.com/forum/thread537011.html -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
I was looking through some publisher code and found that the `process_request` method which takes the request body as a file-like object and processes it as a FieldStorage happens within the application thread. This would be better if it happened beforehand because it can takes up time while a transaction is running and a thread is used although it doesn't require any application-specific code. AFAICT a modified version of FieldStorage would be required to allow line-wise consumption and parsing of data while it is being uploaded and then hand this into the request instead of stdin. However, the FieldStorage implementation is recursive and wasn't obvious to me at a first glance how much work it would be to replace this. Are there similar feelings it would be a good idea to do this kind of early line-wise processing of request bodies? Christian Am Mittwoch, den 07.03.2007, 17:44 +0100 schrieb Christian Theune:
Hi,
[modified slightly from a similar proposal to zope3-dev to match Zope 2's publisher]
I'm writing up a proposal for the ZODB to make even more efficient Blob handling possible.
This includes not copying the data from an uploaded file, but using a `link` operation when possible.
However, the HTTPRequest class currently uses the default implementation of the cgi module's FieldStorage.
I propose to create a small subclass to override the `make_file` method to use `NamedTemporaryFile` instead of `TemporaryFile` to allow the file being accessible from a filename so I can apply a `link` operation.
Notice: The FieldStorage explicitly provides the `make_file` method to allow overriding in this sense.
Does anybody feel like this would be a bad idea?
Christian
_______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope ) -- gocept gmbh & co. kg - forsterstraße 29 - 06112 halle/saale - germany www.gocept.com - ct@gocept.com - phone +49 345 122 9889 7 - fax +49 345 122 9889 1 - zope and plone consulting and development
Christian Theune wrote at 2007-3-7 20:09 +0100:
I was looking through some publisher code and found that the `process_request` method which takes the request body as a file-like object and processes it as a FieldStorage happens within the application thread.
This would be better if it happened beforehand because it can takes up time while a transaction is running and a thread is used although it doesn't require any application-specific code.
In my view, it already now happens far too early, because it may raise exceptions and those exceptions are not handled by the "standard_error_message" usually used for error processing depending on the url. Therefore, if you move out things, you should take care that you move out only parts that cannot raise exceptions. Furthermore, you seem to propose to move work from a worker thread to the IO (i.e. "ZServer") thread. I do not think that it is a good idea to put significant work on the IO thread. Note, that the IO thread is responsible to handle all IO. When you keep it busy with other tasks, it will not handle IO... -- Dieter
Hi, Am Mittwoch, den 07.03.2007, 21:48 +0100 schrieb Dieter Maurer:
Christian Theune wrote at 2007-3-7 20:09 +0100:
I was looking through some publisher code and found that the `process_request` method which takes the request body as a file-like object and processes it as a FieldStorage happens within the application thread.
This would be better if it happened beforehand because it can takes up time while a transaction is running and a thread is used although it doesn't require any application-specific code.
In my view, it already now happens far too early, because it may raise exceptions and those exceptions are not handled by the "standard_error_message" usually used for error processing depending on the url.
Therefore, if you move out things, you should take care that you move out only parts that cannot raise exceptions.
Ah. Interesting point!
Furthermore, you seem to propose to move work from a worker thread to the IO (i.e. "ZServer") thread. I do not think that it is a good idea to put significant work on the IO thread.
Note, that the IO thread is responsible to handle all IO. When you keep it busy with other tasks, it will not handle IO...
Right. This optimization is about leveraging the fact that in many situations the upstream bandwith is *much* lower than the IO bandwith to the disk. Another condition that I have (and I think this is the general pattern) is that application threads should be given back to the pool as quickly as possible. If 5 seconds are spend in the application thread to untangle mime data which has nothing application-specific about it and then only 100ms or so in the application itself, I'd say there is a major overhead problem. Christiabn -- gocept gmbh & co. kg - forsterstraße 29 - 06112 halle/saale - germany www.gocept.com - ct@gocept.com - phone +49 345 122 9889 7 - fax +49 345 122 9889 1 - zope and plone consulting and development
On 3/7/07, Christian Theune <ct@gocept.com> wrote:
Right. This optimization is about leveraging the fact that in many situations the upstream bandwith is *much* lower than the IO bandwith to the disk.
Another condition that I have (and I think this is the general pattern) is that application threads should be given back to the pool as quickly as possible.
If 5 seconds are spend in the application thread to untangle mime data which has nothing application-specific about it and then only 100ms or so in the application itself, I'd say there is a major overhead problem.
I had came to the same conclusion a couple weeks ago, somehow *wink*. Maybe we've been influenced by the same person. :) So if the uploaded file shouldn't be handled by the application thread, neither by the IO layer, then I guess we need a 'upload handling thread pool' of some sorts, whose sole purpose is to handle incoming requests that are large before it gets to the application thread while still outside the async IO layer. Hopefully something similar could be done for files being sent *out* of the application when they don't need any application processing anymore (ie, Blobs!). -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
This use-case is already covered by implementing ZPublisher.Iterators.IStreamIterator. You can return a stream iterator to Medusa and free the worker thread immediately. Stefan On 8. Mär 2007, at 05:40, Sidnei da Silva wrote:
Hopefully something similar could be done for files being sent *out* of the application when they don't need any application processing anymore (ie, Blobs!).
-- Anything that, in happening, causes something else to happen, causes something else to happen. --Douglas Adams
Hi, Am Donnerstag, den 08.03.2007, 01:40 -0300 schrieb Sidnei da Silva:
On 3/7/07, Christian Theune <ct@gocept.com> wrote:
Right. This optimization is about leveraging the fact that in many situations the upstream bandwith is *much* lower than the IO bandwith to the disk.
Another condition that I have (and I think this is the general pattern) is that application threads should be given back to the pool as quickly as possible.
If 5 seconds are spend in the application thread to untangle mime data which has nothing application-specific about it and then only 100ms or so in the application itself, I'd say there is a major overhead problem.
I had came to the same conclusion a couple weeks ago, somehow *wink*. Maybe we've been influenced by the same person. :)
So if the uploaded file shouldn't be handled by the application thread, neither by the IO layer, then I guess we need a 'upload handling thread pool' of some sorts, whose sole purpose is to handle incoming requests that are large before it gets to the application thread while still outside the async IO layer.
I really wonder whether that's necessary. Actually. I'll take a look around the other web frameworks and check how they do their REQUEST processing. Maybe I can learn something from that.
Hopefully something similar could be done for files being sent *out* of the application when they don't need any application processing anymore (ie, Blobs!).
We already have that. The FileStreamIterator allows you to hand out an iterable that will be used outside the application thread to stream data from. Christian -- gocept gmbh & co. kg - forsterstraße 29 - 06112 halle/saale - germany www.gocept.com - ct@gocept.com - phone +49 345 122 9889 7 - fax +49 345 122 9889 1 - zope and plone consulting and development
On 3/8/07, Christian Theune <ct@gocept.com> wrote:
I really wonder whether that's necessary.
Yeah, after re-reading Dieter's reply, I sort of wonder how big of a deal that is.
Actually. I'll take a look around the other web frameworks and check how they do their REQUEST processing. Maybe I can learn something from that.
I suspect you will find that they don't do anything special. You could look at how Apache or Squid does it though.
Hopefully something similar could be done for files being sent *out* of the application when they don't need any application processing anymore (ie, Blobs!).
We already have that. The FileStreamIterator allows you to hand out an iterable that will be used outside the application thread to stream data from.
Right, I had something else in mind which is serving large data from the ZODB when you're done with the application logic. But that's another story. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
Christian Theune wrote at 2007-3-7 22:05 +0100:
... If 5 seconds are spend in the application thread to untangle mime data which has nothing application-specific about it and then only 100ms or so in the application itself, I'd say there is a major overhead problem.
But if the IO thread spends 5 seconds, then Zope will be unresponsive for 5 seconds -- for me (and hopefully others, too) a far more critical situation than a (single) worker spending 5 seconds... The IO (ZServer) thread should only perform minimal work in its "asyncore" callbacks -- each callback should return within a few milliseconds. My argument does not argue against different threads between the IO thread and the worker threads, just against giving the IO thread significant work (whether or not you consider it application specific). -- Dieter
Hi, Am Samstag, den 10.03.2007, 07:36 +0100 schrieb Dieter Maurer:
Christian Theune wrote at 2007-3-7 22:05 +0100:
... If 5 seconds are spend in the application thread to untangle mime data which has nothing application-specific about it and then only 100ms or so in the application itself, I'd say there is a major overhead problem.
But if the IO thread spends 5 seconds, then Zope will be unresponsive for 5 seconds -- for me (and hopefully others, too) a far more critical situation than a (single) worker spending 5 seconds...
The IO (ZServer) thread should only perform minimal work in its "asyncore" callbacks -- each callback should return within a few milliseconds.
My argument does not argue against different threads between the IO thread and the worker threads, just against giving the IO thread significant work (whether or not you consider it application specific).
Ah. Maybe I didn't point this out enough. I was thinking about switching to a "chunk-based" approach on processing the request. That way I want to avoid having to process large file data twice. IMHO a variation of the FieldStorage could be implemented that processes the data line by line as it comes in. That would avoid the 5 sec. delay if you process it at once and should not be a problem in the IO thread. Christian -- gocept gmbh & co. kg - forsterstraße 29 - 06112 halle/saale - germany www.gocept.com - ct@gocept.com - phone +49 345 122 9889 7 - fax +49 345 122 9889 1 - zope and plone consulting and development
Christian Theune wrote:
Hi,
[modified slightly from a similar proposal to zope3-dev to match Zope 2's publisher]
I'm writing up a proposal for the ZODB to make even more efficient Blob handling possible.
This includes not copying the data from an uploaded file, but using a `link` operation when possible.
I think this is a great idea. Am I the only person here who immediately associated "link" with the POSIX? Also, am I the only one who read "when possible" as "when on a POSIX system where link is available", in other words, "when not on Windows"? One starts to wonder...
However, the HTTPRequest class currently uses the default implementation of the cgi module's FieldStorage.
I propose to create a small subclass to override the `make_file` method to use `NamedTemporaryFile` instead of `TemporaryFile` to allow the file being accessible from a filename so I can apply a `link` operation.
+1 -- http://worldcookery.com -- Professional Zope documentation and training
Note that one micro-optimization for PUT requests is to not use a FieldStorage at all because the body is never mime-encoded anyway in practice. I have a monkey patch to do this now, which I turned into a patch for the core, but took out because Phillipp whined at a sprint once. ;-) Here's the monkey patch... def patch_httprequest_processinputs(): """ Patch HTTPRequest.processInputs to not do any processing on a PUT request (it's pointless, and foils our on-the-fly encryption, as it creates a new tempfile via FieldStorage). """ # note that OTF encryption support only works for PUT requests import re from ZPublisher.HTTPRequest import HTTPRequest oldProcessInputs = HTTPRequest.processInputs def newProcessInputs( self, # "static" variables that we want to be local for speed SEQUENCE=1, DEFAULT=2, RECORD=4, RECORDS=8, REC=12, # RECORD|RECORDS EMPTY=16, CONVERTED=32, hasattr=hasattr, getattr=getattr, setattr=setattr, search_type=re.compile('(:[a-zA-Z][-a-zA-Z0-9_]+|\\.[xy]) $').search, ): """Process request inputs We need to delay input parsing so that it is done under publisher control for error handling purposes. """ method=self.environ.get('REQUEST_METHOD','GET') if method == 'PUT': # we don't need to do any real input processing if we are handling # a PUT request. This is an optimization especially because # FieldStorage creates an additional tempfile if we allow it to # parse the body, and PUT uploads can tend to be large. self._file = self.stdin return return oldProcessInputs(self) HTTPRequest.processInputs = newProcessInputs - C On Mar 7, 2007, at 9:57 PM, Philipp von Weitershausen wrote:
Christian Theune wrote:
Hi, [modified slightly from a similar proposal to zope3-dev to match Zope 2's publisher] I'm writing up a proposal for the ZODB to make even more efficient Blob handling possible. This includes not copying the data from an uploaded file, but using a `link` operation when possible.
I think this is a great idea.
Am I the only person here who immediately associated "link" with the POSIX? Also, am I the only one who read "when possible" as "when on a POSIX system where link is available", in other words, "when not on Windows"? One starts to wonder...
However, the HTTPRequest class currently uses the default implementation of the cgi module's FieldStorage. I propose to create a small subclass to override the `make_file` method to use `NamedTemporaryFile` instead of `TemporaryFile` to allow the file being accessible from a filename so I can apply a `link` operation.
+1
-- http://worldcookery.com -- Professional Zope documentation and training _______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
Good point. I'll re-read the spec and will try to integrate that. Am Mittwoch, den 07.03.2007, 22:46 -0500 schrieb Chris McDonough:
Note that one micro-optimization for PUT requests is to not use a FieldStorage at all because the body is never mime-encoded anyway in practice.
I have a monkey patch to do this now, which I turned into a patch for the core, but took out because Phillipp whined at a sprint once. ;-)
Here's the monkey patch...
def patch_httprequest_processinputs(): """ Patch HTTPRequest.processInputs to not do any processing on a PUT request (it's pointless, and foils our on-the-fly encryption, as it creates a new tempfile via FieldStorage). """
# note that OTF encryption support only works for PUT requests
import re from ZPublisher.HTTPRequest import HTTPRequest oldProcessInputs = HTTPRequest.processInputs def newProcessInputs( self, # "static" variables that we want to be local for speed SEQUENCE=1, DEFAULT=2, RECORD=4, RECORDS=8, REC=12, # RECORD|RECORDS EMPTY=16, CONVERTED=32, hasattr=hasattr, getattr=getattr, setattr=setattr, search_type=re.compile('(:[a-zA-Z][-a-zA-Z0-9_]+|\\.[xy]) $').search, ): """Process request inputs
We need to delay input parsing so that it is done under publisher control for error handling purposes. """ method=self.environ.get('REQUEST_METHOD','GET')
if method == 'PUT': # we don't need to do any real input processing if we are handling # a PUT request. This is an optimization especially because # FieldStorage creates an additional tempfile if we allow it to # parse the body, and PUT uploads can tend to be large. self._file = self.stdin return
return oldProcessInputs(self) HTTPRequest.processInputs = newProcessInputs
- C
On Mar 7, 2007, at 9:57 PM, Philipp von Weitershausen wrote:
Christian Theune wrote:
Hi, [modified slightly from a similar proposal to zope3-dev to match Zope 2's publisher] I'm writing up a proposal for the ZODB to make even more efficient Blob handling possible. This includes not copying the data from an uploaded file, but using a `link` operation when possible.
I think this is a great idea.
Am I the only person here who immediately associated "link" with the POSIX? Also, am I the only one who read "when possible" as "when on a POSIX system where link is available", in other words, "when not on Windows"? One starts to wonder...
However, the HTTPRequest class currently uses the default implementation of the cgi module's FieldStorage. I propose to create a small subclass to override the `make_file` method to use `NamedTemporaryFile` instead of `TemporaryFile` to allow the file being accessible from a filename so I can apply a `link` operation.
+1
-- http://worldcookery.com -- Professional Zope documentation and training _______________________________________________ Zope-Dev maillist - Zope-Dev@zope.org http://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://mail.zope.org/mailman/listinfo/zope-announce http://mail.zope.org/mailman/listinfo/zope )
_______________________________________________ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/ct%40gocept.com
-- gocept gmbh & co. kg - forsterstraße 29 - 06112 halle/saale - germany www.gocept.com - ct@gocept.com - phone +49 345 122 9889 7 - fax +49 345 122 9889 1 - zope and plone consulting and development
On 3/7/07, Philipp von Weitershausen <philipp@weitershausen.de> wrote:
Am I the only person here who immediately associated "link" with the POSIX? Also, am I the only one who read "when possible" as "when on a POSIX system where link is available", in other words, "when not on Windows"? One starts to wonder...
NTFS does support hard links since version 5, which means Windows 2000+. It does not support hard links to directories though, only soft links (which are called junctions or junction points). The version of NTFS shipped with Vista supports hard links for directories I believe. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sidnei da Silva wrote:
On 3/7/07, Philipp von Weitershausen <philipp@weitershausen.de> wrote:
Am I the only person here who immediately associated "link" with the POSIX? Also, am I the only one who read "when possible" as "when on a POSIX system where link is available", in other words, "when not on Windows"? One starts to wonder...
NTFS does support hard links since version 5, which means Windows 2000+. It does not support hard links to directories though, only soft links (which are called junctions or junction points). The version of NTFS shipped with Vista supports hard links for directories I believe.
POSIX systems don't allow hard links to directories, either, in practice:: $ man ln ... -d, -F, --directory allow the superuser to attempt to hard link directories (note: will probably fail due to system restrictions, even for the superuser) Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF76hn+gerLs4ltQ4RArm3AJ9o6Vw63qc6cJT3GPJOVebCFhDtiACfViDa EI1MxpxIIwEQl3uXVxly7M4= =NZUy -----END PGP SIGNATURE-----
participants (9)
-
Chris McDonough -
Christian Theune -
Dieter Maurer -
Martin Aspeli -
Paul Winkler -
Philipp von Weitershausen -
Sidnei da Silva -
Stefan H. Holek -
Tres Seaver