Profiling has shown the slowest part of the module is audio/video broadcasting. That was pretty obvious even before profiling but now I've got the numbers. To optimize the bottleneck output buffering model should have been slightly modified. Until now each output chain buffer (chunk) had it's own reference counter. This approach suffers from a problem of re-zipping the chain (re-creating chain links) for each client which required additional per-client chain link allocations.
The new approach is much simpler than before. Each client has fixed-size output queue of chain references. Both buffer and chain links are never changed since creation. The refcounter is now associated with the whole chain and stored in -1 byte of chain link.
The work has been committed to optsend branch. It will be merged into master branch shortly.