Well, frankly I can’t know that. But I know what I did.
Due to some defects in standard java RMI protocol, I decided to write my own custom protocol. Must be noticed that I didn’t do this completely from the scratch, but concept is based into earlier custom wrappers for normal RMI to allow passing of session identity. However, those hacks were adding extra complexity in RMI, and actually making whole usage of RMI almost irrelevant.
Defects in RMI:
- Too much blackbox in socket handling
- No support for transparent session identity passing
- Not really supporting data stream compression
- ObjectStream overhead
- Infamous RMI ping overhead (ex. RMI newConnection() method performs unnecessary Ping/PingAck)
- Customizability of protocol
- Not possible use ”proxy” server for handling request
In other words I tried to address all of those issues, and end result is this:
For testing, start TestServer
and after that launch TestClient
.
Notices:
- Just compress doesn’t improve much
- ”No desc” ObjectStream makes significant difference. For uneducated ”no desc” object stream is special variant of object stream, which won’t write/read class descriptors into stream. Skipping them is possible if version compatibility of streams is not required. And in RPC calls that happens to be generally the case; thus such optimization makes sense.
- Implementation is only around
1500 lines.Scratch that out, after few future enhancements, size bloated into 2000+ lines. However, jar for whole implementation is only 34KB.. In the ever increasing bloatness, current figures are: 4500 lines and 45 KB. Compress relies into GZipOutputStream ”syncFlush” flag (so JDK7 is requierd). Nope, compression relies now into direct use of Deflater/Inflater so no dependency there
Future Improvements:
Now, since I’ve full control of whole stack in both client and server, I shall next investigate few issues
a) Proxy server
Proxy server allows opening multiple different kinds of connections via single server socket. Since I’ve control of server side, it should be trivial to add support for such. Thus no need to have dedicated socket for call server, but allow using shared server socket.
==> This doesn’t require anything special. Only thing what is needed that proxy server detects protocol and then launches new ServerHandler(server, socket).start();
for received socket. Referred ”server” is naturally approriate CallServer instance, which doesn’t need to be started, since handler needs only registry from server.
For client side, logic is naturally based into custom CallClientSocketFactory
, which will write relevant ”protocol” identity into stream when creating new socket. Protocol identity is needed only when creating new socket, since after that handling of socket is done by ServerHandler in server side.
b) Optimized session identity passing
Since session identity doesn’t change on every call (or protocol could at least assume so), it’s possible to optimize protocol to not pass session id on every call. It’s necessary to pass identity from client to server only if it changes. For rest of the time, server can cache identity, and reuse it from same socket. This idea bases into assumption that single socket is bound just into single client, which has single identity. If client JVM has multiple sessions, it just need to have multiple CallClient instances to manage them.
==> After applying this optimization, it is clear that compression causes problems with small calls. With calls with small parameters, compressed call costs more. Problem for framework is of course that with ”stream” approach it’s not possible to deduce before hand if compression would make sense or not. I.e. if call size is over 1000 bytes then compression may make sense, but not before that. It could be possible to handle this by having intermediate ByteArrayOutputStream (*) to collect first data and transfer only that over network (with compression if needed). Notice this is going close to case (c) improvement.
(*) Please notice that I’ve earlier experimented such approach with normal RMI. However, due to lack of overall control I had to give up concept there, since doing such compress logic caused too much extra overhead in small calls which wouldn’t be compressed. Also managing compress buffers was problematic. Due to that in the end I applied special RMI compatible compressed IO streams, such ”compress all” solution is also caused its’ own issues. Thus I wasn’t ever very satistfied with it.
==> After applying BufferCall/BufferResult logic balance between compression vs. no-compression seems to work nicely. If call size goes over certain threshold it’s compressed, and same in receiver side. Overhead from process is nicely balanced to utilize extra memory buffer only in sender side for encoding intermediate data block. In recipient side overhead is minimal, since GZip stream can be simply attached into input coming from socket. After this, overhead from ”buffer” ”stream” was very minimal, except ”buffer” manages compression on demand without constant compression overhead. Also since intermediate encoding buffer is reused overhead should be rather minimal. Only thing what might be necessary is to reset buffer to null in case size becomes too big (ex. 100k or something like that).
c) Support ”message” style calls
Currently API utilizes only stream style calls. However, adding basic ”message” call variant should be doable. However, due to some assumptions, logic cannot be purely message based, but rather sending encoded (as byte array) message objects over socket instead of outputting ObjectOutputStream directly into socket.
Supporting this would allow utilizing most compact possible encoding for special calls. Also implicit support for ”message” objects could be handled in mSessionId and mParams (in StreamCall). Thus if object implements certain encoding API then it wouldn’t be encoded as ”Object” but as ”message”. This could allow even more compact encoding than ”no desc” object stream for special cases.
==> After some consideration I’m ignoring this for a while. Reason being that benefits of trying to mix different serialization types won’t bring so huge benefit, compared to generic nature of Serializable/Externalizable.
d) Improved error handling
Initial logic tries to do reasonable error control, but there might be need to improve ACK logic in communication. Especially client should be able to differentiate between initial call handshake failure vs. failure in later on in call. Issue is that if call fails in initial handshaking then error is retriable error, which can be tried again by killing TCP connection and starting new one. However, if failure occurs later on, after writing actual call into socket, then call might not be retriable, since server may have already done actual call in server side.
==> Added ACK in server side after call has been successfully received in server side, but not yet invoked. This should allow client side to know if communication error is retriable or not.
e) Asynchronous Batch Calls
Important benefit ”message” style call logic is that it would easily allow call batching. Thus passing multiple calls via single request to server side. Doing so is standard trick in high throughput systems, since there is always certain overhead on every separate TCP packet send over network. If multiple calls can be squeezed into single TCP/IP packet, throughput of simple ”asynchronous” calls can increase significantly. Of course this pattern doesn’t help much if client side needs to anyway wait for results.
f) Socket restart
Depending of application and server address usage it’s needed to be possible to restart client/server sockets new server address actual server IP binding.
==> Implemented logic to allow restarting client handlers and server socket. Left away logic for restarting server side handlers, since generally restarting those ain’t needed (socket closing in client side should eventually discard those softly).
g) Improved defaults
For basic application, it would be better that application doesn’t need to implement various factories.
==> Added default factories for both call client and server. Thus basic application working with plain sockets doesn’t need anything special.
h) Improve memory usage
Profiling showed up that while IO is basically efficient, there is some overhead from creation of IO streams separately for every call. Should investigate how to reduce overhead from this, since it eventually causes extra gc() overhead in JVM. Deflater/Inflater streams might be first target here, since object IO streams are somewhat more tricky to manage on this aspect.
For Deflater/Inflater, it could be possible to use explicit instances instead of GZip streams. Or implement own deflater/inflater streams, which would avoid issue with GZip streams.
For Object streams, issue is trickier, since there is always small overhead in creating these streams. However, reuse of streams is not so easy, since they don’t have API for that. Basically there is only ObjectOutputStream.reset()
, which writes special ”reset” code into stream, which is then parsed by receiver side. However, clearly there is few issues which such with remote calls.
When utilizing BufferCall, it might be possible to see if trick with ObjectOutputStream.reset()
and DirectByteArrayObjectOutputStream.reset()
would allow reusing object streams.
Thus logic would be this:
- Write into ObjectOutput going into byte-buffer
- Flush ObjectOutput
- Note down byte-buffer size
- Reset ObjectOutput and flush it
- Write byte-buffer (upto size recorded earlier) into socket, with possible Deflater usage
- Reset byte-buffer
In theory, after that re-used ObjectOutput would ”believe” to start from clean state after reset, and sub-sequent write would begin again from the beginning of byte-buffer. And of course similar trick should be done in server side.
==> Theory wasn’t quite upto actual implementation. Thus adding reused Deflater/Inflater was completely feasible. However, attempting to reuse ObjectOutoutStream and ObjectInputStream proved to be tricky. Basically I tried hack described above, but eventually streams get out-of-sync, triggering EOFException in data reading in ObjectInputStream. It’s possible that there was bug in logic, but it’s also possible that object streams simply don’t like this kind of trickery.
This leaves open option for discarding BufferCall
concept, but rely only into StreamCall
, which might allow wrapping IO by having permanent object streams (combined with GZip) between client and server. Caveat emptor of that is that compression will become inefficient, since compression will in small calls actually increase network traffic. However, solution would be (if it would work) more beautiful than this deflater/inflater reuse trick in buffer call (just look into code to see that logic became much more complex code after applying direct compression usage). Regardless of that, caveat emptor of this approach is likely too big; basically solution bounds whole logic to be strictly stream based, which is exactly something which I’ve wanted to avoid.
Okey…, figured out what was wrong with object stream reuse. Problem was that after ObjecOutputStream.reset()
it’s necessary to write extra dummy byte into stream, and read it back in input stream handling. If that is not done, then stream goes out of sync, since TC_RESET isn’t read from the stream unless there is some data after it.
So reusing streams works now, and it makes significant different in gc() behavior (*). Allocations are significantly decreased. However, there is clear caveat emptor. Now stream is ”wasting” always ”4 + 1 + 1 + some padding = 8” bytes on every event. That is clearly some overhead in small events. However, as overall cost, savings in memory allocation side are much more than what is lost in data size. So in overall still in plus side.
(*) Before stream reuse, when executing TestClient
in infinite loop, then server side resolted very quickly into growing Eden space of JVM into ridiculous size (400MB), which sounds a bit nasty behavior. After reusing that server JVM optimization didn’t occur so quickly.
Optimizing…: Due to extra overhead in call size, I decided to see if some encoding changes would allow to get back to original sizes. After applying two tricks: (a) enforce service and methodId to be ”short” value and (b) use compact short/int encoding for stream. These two tricks allowed to get call size in stream to be very close to ”original”, i.e. case before reusing object streams. Trick is that for most of cases size values are small, thus encoding them with 1 or 2 bytes in stream is sufficient. Actually, ”call size” in TestClient
is now few bytes smaller than in original non-reused case. However, ”result size” is around few bytes longer, since enhanced short/int encoding doesn’t have much benefit on that side.
Naturally this didn’t come without price, since yet another extra layer of complexity in code was added. Thus call encoding logic is not pretty far from original ”clean and simple”.
i) Improved connection pooling logic
Client handler pooling is bound into CallClient instance. Should consider if separate pool to allow sharing actual connections between multiple clients in same JVM would be reasonable. In application, which would be creating various call clients (due to sub-system boundaries, etc.) against same server, such shared pool would be beneficial. However, such logic would complicate some behavioral logistics; like possibly different factories between call clients
==> It might be relavant to leave this to be application problem. I.e. application can implement its’ own singleton to pool CallClient instances based into serveraddress/port/etc. characteristics.
j) Client side callstack
Standard java RMI provides both server and client side call stacks in exceptions occurred in server side. From debugging point of view this is extremely important, since lack of client side stack may make analysis of stack trace impossible. Thus it’s mandatory to provide similar logic than in plain java RMI to gather client side call stack, and somehow append it into exception.
Adding server side stacks was actually very easy. And while on that, it allowed also skipping few bytes from trace in server side. In server side stack, 3 levels can be skipped since stack elements for server side framework internals aren’t really interesting for application error. And same also in client side; skip framework internals for server side exceptions. In overall stack traces should be thus slightly improved compared to java RMI.
Rollback…, it was bad idea to start doing ”stack filtering” in server side. Reason being that it causes extra CPU load for server, since Throwable.getStackTrace() creates clone of stack trace. Thus with large stack traces this would cause relevant gc() overhead in exception logic. Due to that I discarded server side stack filtering logic; its’ benefit was anyway minor, just skipping 3 levels in stack trace.
k) Separate TCP and processing
Current implementation is binding together send/receive/process. This design has nasty caveat emptor. It means that design ain’t as robust as connection problems as desired.
To clarify what is the problem:
Since send/receive/process are bound around single TCP connection, it means that if there is failure in TCP socket it immediately means that results from sent call cannot be get back to client. This may not sound huge issue, but if working over internet (i.e. not local network) TCP connection failure and connection re-establishment is not so rare occurrence. So this design can fail miserably if there is any connectivity issues, which just happen to be timed badly.
Clear fix for the issue is to separate call create, transmission and reception from each other. In practice this means implementing point (c) (i.e. message style calls). Luckily the fact that I practically deprecated original StreamCall
design by BufferCall
, part of this is already done. Now, what is missing is separation of transmission and processing parts.
Doing this will clearly add yet’another extra complexity later into design, since it requires juggling state betweent threads. However, it will also have benefits. One clear benefit, IMO, is that design will allow more conservative resource usage, since multiplexing multiple calls over socket some naturally. Also it will also reduce issues due to TCP SOTimeout, which current design requires to be as long as processing time taken in server side (you may guess that this poses problems if processing takes hours). Now with this transfer/process separation, it’s trivial to do some pinging over sockets to ensure that they stay really alive, without long TCP SOTimeout. Interesting extra benefit is also immediate support for 100% asynchronous calls, where client doesn’t need to wait answer from server.
However, on caveat emptor side, design causes extra overhead in protocol, since every call now must include ”callId” to allow dispatching them back to client via separate threads. Also this makes managing multiple sockets trickier, i.e. in most trivial approach single TCP socket is used to multiplex all calls. BUT, that means that if there happens to be single huge call/result transfer, then all calls/results are delayed due to that. That is clearly undesired. For call side, client can manage that by automatically opening new (pooled) connection if there ain’t free one. But for results transfer in server side. Well, nothing can be really done, since server cannot open socket into client direction, and client cannot know that it should open another socket. Server cannot either notify client to do so, since socket is reserved for large transfer.
Very likely case of ”large transfers” must be left for ”wishfull load balancing” scope. I.e. client may open multiple sockets, and hopefully it will open few of them due to concurrency. Then, even if occasional large transfer occurs it won’t block smaller transfers. However, if multiple large transfers occur simultaneously, then logic is clearly out of luck. This could be improved by ensuring in client side that there is certain ratio of open connections compared to active calls, thus there is better changes of avoiding blockage due to large transfer.
==> Closer investigation of this approach shows that this clearly complicates design and adds extra overhead. After initial experiment it’s clear that asynchronous calls must be separate from normal calls so that overhead required by them is not affecting normal calls. Also it’s necessary to manage separately connections for both async call sending to server and getting results back to server. Already simplistic attempt of reusing same sockets proved to complicate design redundantly (ex. race conditions between threads). Having separate logic for async calls shouldn’t be problem since by definition ”async” means that caller ain’t strictly requiring immediate call, thus batching of calls (and also results) is allowed.
l) Idling and SOtimeout
SO timeout handling needs improvement. Current API is limited into having ”max cal timeout” specified via SO timeout. Eventually this timeout is defined by socket factory, which can cause that timeout can be unexpectly short (also depending of OS). Clearly max time for call shouldn’t depend from SO timeout.
Another issue is cleanup of resources. In perfect world client application would manage resources properly, thus cleanup them after use. However, in real world, there is problems with that thought: (a) client applications are surely sloppy, they won’t always cleanup resources properly, (b) client (or server) can crash leaving dangling connections. Clearly better approach is to have some automatic cleanup for unused connections.
==> Implemented both improved SOtimeout handling and idle cleanup of connections both in client and server side.
m) Custom serialization formats
Framework has basic support for utilizing arbitrary custom serializers. However, actually doing so requires some enhancements in logic. To fully support arbitrary object serialization, core concept is based into standard java object serialization. However, replacing this with another implementation is possible.
Currently using custom serialization protocol requires writing new XCall/XResult classes which utilize some arbitrary custom serialization logic. Currently doing so, however, ain’t yet possible by application logic, since call logic isn’t extensible currently.
Considering custom serialization formats (in addition to custom ”compact” encoding of standard java object stream) my primary interest right now is Hession Serialization. And here is some example of its’ usage. However, actual benefits of such can be seen only after profiling with real application. Also, based into information available, it ain’t sure at all it can really serialize all arbitrary object graphs what standard serialization supports.
Two alternative paths to try this out: (a) implement support for custom Call classes, or (b) try to custom ObjectStream as wrapper for this encoding.
==> For comparison re-evaluated benefits of CompactObjectOutputStream / CompactObjectInputStream
vs. standard java versions. Both with deflate compression. With huge pure binary random data, difference is naturally minimal, but with all other test results, ”compact” streams beat java versions hands down. Compact stream is around 2.3x more space efficient. Now comparing this to earlier results stated with Hessian, it sounds like there wouldn’t be much point even try such encoding, since compact object stream approach already does pretty compact job.
Why such result? It’s due to fact that if compact object stream skips serializing object metadata, since there is no need to do so in network communication (when both client and server side have same classes). Also classname are encoded with extremely compactly. Thus compact stream is reducing overhead of boilerplate data, which is significant with small object graphs. Since Hessian is also encoding its’ own variant of object metadata, there is good changes that compact object stream beats also that.
More info: Pointless to research Hessian encoding due to this issue with Externalizable
support. Thus since Hessian doesn’t cannot support Externalizable is further research of it pointless, since it would makes restriction of what kind of elements can be serialized. So it goes into same pool as various other alternative serialization formats in java; they are usable in cases were serialized object types are restricted very strongly, or they require their own custom serializers. Which is of course impossible to do for classes coming from libraries.
Packaged version of ”Call”
Okey, I changed repository structure so it’s easy to just generate jar for ”Call” API and test it with suitable system.
For usage examples, please see TestServer
and TestClient
.
To ease testing I added gradle targets for testing:
[code]
gradle :call:runServer
gradle :call:runClient -Dhost=localhost
[/code]
10.3.2013
Phew! This little projecct proofs, once again, 80/20 rule. I got 80% complete implementation over one weekend. But remaining 20% is increasing performance, robustness, etc. aspects into production quality. And clearly that 20%, which is really needed to have ”production” quality is what takes the time.
Also it’s clear that various missing parts are coming up only by experimenting. Thus when trying with small test application, few major issues won’t ever come up. But when actually experimenting real application, various interesting ”missing” parts come up.
Also after getting base implementation ready, then desire to make it better and more polished increases.
I, however, know when to stop. I’ve to stop when I run out of alphabets to list new features. Of course, luckily I’m Finnish, since I’ve spare extra alphabets still available at point when English speakers have ran into dead end. Let’s all praise Å, Ä and Ö.
strongstrongstrong