GSoC 2010 Projects

Hi,

I read your review email and it has certainly given me a lot to think about… But first I would like to explain what my proposal actually intends to do.

The basic aim is to modularize the server without any loss in functionality… Why is this necessary? Because as I understand it The Achilles’ Heel problem basically arises because the server is using the same set of resources to handle all its tasks. As a result of this if some task blocks the resources ALL tasks are affected. If the server is modularized in case of heavy loading it can still provide a limited service and does not completely die down.

I understood from your review that you are not sure whether the problem I am trying to solve is really the problem that needs to be solved. I studied the way the openfire server works and this is my interpretation of the problem.

You mentioned that :

“it may be useful to take a step back, and
merely create conditions that reproduce the error in Openfire, then
plot out how the server is responding by looking at process
utilization, profiling, and memory growth.”

Yes that is an extremely valid point however I do not have the resources to do the above and that is why I want to work with you guys…If the real problem turns out to be something else I will draw out a plan to solve that problem. Please understand that my proposal is merely an attempt to solve a problem as I see it…and my analysis is based on my study of the code and my interpretation of how openfire works. If i am given the chance to work with you guys my first priority would be to identify and define the problem and if the problem turns out to be something else then I would draw out a plan to solve that problem. My proposal is merely a snall demonstration of whole cycle and I admit that I too have no way to identify the real problem. Identifying the real problem is a major step in solving the Achilles’ Heel problem .

At this point I want to ask a small favour from you…Assuming the problem is what I anticipated… then does my proposal solve it??

Thanks for the heads up on the testing and implementation related issues. I forgot to think about them…Rest assured that i am currently working on a proposal for that too…“In the proposal, there is no mention of how you are planning on
setting up a test bench” working on it…

I appreciate you taking out time to review my proposal… Thanks a lot!!

Another note : i feel a little stupid asking this but should I draft another proposal that is more general to the type of problem??

I do not think this requires another proposal. I only mention it so we have the advantage of having thought about it. The only other thing I can think of is that your proposal meets all of GSoC standards, in order to give you the best shot at winning the project approval.

Hello,

Let me first applaud you on the amount of effort you are putting into this proposal. It clearly shows that you are giving this a lot of thought.I feel that you are generally on the right track with the proposal that you made available at http://gsocopenfire.wordpress.com/2010/04/05/gsoc-proposal/ but I do have a number of reservations.

You need to get this right, given the nature of the problem and the severe impact to the project that the solution is likely to have. There’s most likely no going back after we release code based on your modifications in the source of the project. Apart from this, you’re also petting the belly of my pet project. I feel strongly committed to solving the problems that have been outlined. Given all of these reasons, I will be brutally honest in my comments. None of my comments are intended to scare you away though. Please view them as challenges.

The subject that you are tackling is a very complicated one. I believe that this particular problem calls for *expertise *on a number of subjects, including:

  • Concurrent programming in Java
  • XMPP (specification) theory
  • Intimate knowledge of Openfire’s internals

The proposal that you have drafted shows that you have looked into the issue considerably. It does not (yet) give me the “this guy is an expert that we need” feeling though. Perhaps this is not fair, as the proposal is (a) a first draft, (b) not a detailed design (it’s not supposed to be) and © you’re no Openfire veteran, but as I said - brutally honest. To give you an example: you get into design details in various parts of your proposal draft, describing sensible techniques, but you do not base your descriptions on documented Java patterns or techniques. I’m not sure if you did this intentionally, if I’m missing obvious references (hey, I’m not perfect ) or if you are not familiar with these patterns. If the latter is true, then there’s a lot of documentation available to get you going. I can warmly recommend reading “Java Concurrency in Practice” by Brian Goetz. The JCiP book will provide you with a lot (if not: most) of the building blocks that you will need during implementation of whatever solution you come up with. The same person has authored a number of interesting reads at IBMs DeveloperWorks website, titled “Java Theory and Practice.”

I particularly like the first part of your proposal. What I like less about the proposal is that you seem to be committed to a solution already. Given the inherent complexity of the issue as described above, no-one can expect you to have a solution ready. It is good that you have one (or more) suggested ways to tackle this problem, but I feel that your proposal should focus a lot more on finding the solution to the problem, rather than to implement the solution. Lets not cut corners! I suggest that you identify alternatives and/or describe how you will identify these alternatives, their plusses and minuses, their risks, impact, and side effects.

This project is going to be a big one. Apart from that, you will run into unforeseen challenges. I would go as far as to say that it is quite unlikely that you (or anyone else) will deliver a finished, revised version of Openfire that completely fixes the Achilles’ Heel problem by the end of the GSoC period. The time is simply to short for that. This is perfectly acceptable though. What you do need to provide in your proposal is a solid project planning. How will you divide this project into phases or sub-projects? What will you do in each phase? In what order will you execute the phases? Are phases depending on each-other? Which phases will you complete during the GSoC period, which ones would you like to complete if there’s time left (yeah, right), and which ones won’t be completed as part of your GSoC project? From that last category, which ones are important to have done afterwards, and which ones are of less importance? And so on, and so on.

As towards Slicers comment regarding testing, I’m biased. On the one hand, you will need to be able to verify your solution. This is evidently important. On the other hand - I have never seen a full-fledged testing setup that accurately simulates the oddities of a production environment. Development of such setups are almost always a waste of time. I suggest that you focus on verification of your solution (does this or that technique give me the safety/performance/whatever that I require), not on simulation (let’s try to find out what happens if 20.000 people send random stanzas at the same time). Not having resources available to do testing is not an acceptable argument here though - you have an entire community to your disposal! There will be a number of people happy to help you run tests. You *will *need to prepare those tests and tell us what to do, of course.

On a completely different note: I think you should submit a proposal to the GSoC project page soon. Your potential mentors will be able to read it there (not all of them read IgniteRealtime) and can provide you with valuable feedback before this weeks deadline passes. Feel free to link to this discussion.

In conclusion, I believe that this proposal has a lot of potential. I believe that there’s a lot to be done, but I think you can be able to make this work! Please, continue!

Thank for the advice!

I am currently studying Concurrent Programming like you recommended. I will redraw the proposal as soon I am confident on the subject.

I will need some additional information on the working of the openfire server. I have read the source code and the documentation but i am not able to place all of it . Can you point me to some references ??

Do you think my proposal is competant as a GSoC proposal?? I have tried to make it complaint to the GSoC requirements as far as possible (BTW I have applied officially ). I will be creating a detailed project plan very soon and updating the project accordingly.

“I suggest that you identify alternatives and/or describe how you will identify these alternatives, their plusses and minuses, their risks, impact, and side effects.” … Already working on it!!

It seems that my comment on wordpress got lost. OF-115 describes already a concurrency problem as some messages (login, privacy, MUC, chat, …) must be processed in the right order.

In DOC-1925 you write “” and the solution may be very simple: Increase the number of threads and make sure that there are always idle threads, it’s only a matter of memory and processors.

I agree that one will need more than one summer to solve this problem.

I think that it’s hard enough to to strip down Openfire to a core XMPP router and then add authentication and privacy lists support. There the fun begins as one must process a privacy list change before processing the next incoming message for this connection.
http://gsocopenfire.files.wordpress.com/2010/04/interceptor.gif shows a scheduler for every connection / XML stream with three interceptors. I really wonder whether we have a performance problem processing incoming messages from one connection. Maybe this makes things worse at it needs more resources.

Monitoring the throughput of every connection and the processing time of every message and message types may be more interesting and may give one a clue where one looses performance. Assuming that the processing time is limited one will consume much more processing time when a lot of thread pools are used. So one way to keep Openfire running is to drop client connections or to return errors for certain messages instead of processing them.

LG

Hi LG,

Thanks for reading my proposal. The problem you described is indeed a major one. You suggested that we may need to put every client session in a different thread. However, I’m having some trouble wrapping my head around this. I believe we could implement a serializer (albeit a messy one) into the new setup as i described in my proposal. Another apprach may be giving every session a new thread and at the same time ensuring that different types of requests are handles differently eg .If the main thread needs a MUC task to be processed it will simply forward the request to the RequestHandler for MUC (or even spawn child threads). I’m currently working on that and I will post a detailed analysis shortly. What do you think will be better? serializer or request forwarding or maybe something else (suggestions are welcome!! )

another point :

“one way to keep Openfire running is to drop client connections or to return errors for certain messages instead of processing them.” Indeed! and
I believe in a modularized approach implementing this is a lot easier. Eg if we a ‘difficult’ and an ‘easy’ request as a part of the same session then in a modularized approach the server could simple reject the difficult request and execute the easy one.

“the solution may be very simple: Increase the number of threads and make sure that there are always idle threads, it’s only a matter of memory and processors”

no fun in that! besides this makes for a great summer project .

Regards,

Anshul

http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/ana28192/t 127052208159

My official GSoC proposal… most of it is copied of my blog… but i’m planning on updating it ASAP with a more pragmatic approach in view of the points discussed above…

Hi Anshul,

where did I write this: “You suggested that we may need to put every client session in a different thread.”? JiveMessenger and Wildfire did use a thread for every client connection. With 2000 clients they got 2000 threads, each with a (default) stack size of 256k resulting 0,5 GB memory usage only for thread stacks. This was a reason to switch to NIO / MINA.

Every connected client has a connection and thus an open XML stream. Maybe your picture needs to be updated a little bit or “XML from Socket” should be named “mix of XML messages from all clients”. If this is the case then one may ask why we first mix the messages and then we use a scheduler to feed interceptors.

Conference (MUC), Broadcast and Search are components, so Openfire can simply route packets to them as they are not really in the core. It should be quite easy to create a component which takes long to process messages and run it as a plugin. As a component is something “external” one should place the logic to monitor and eventually skip the component in Openfire and not in the component.

Eg if we have a ‘difficult’ and an ‘easy’ request as a part of the same session then … the server could simple reject the difficult request and execute the easy one.” Right. As you may have noticed I did remove “in a modularized approach” because it does not really matter whether the server is modularized or not as long as one knows what is easy and what is difficult. But how does the server know that a difficult message must not be processed? Maybe one can define “maxPepTime=10ms” and when PEP messages take longer than 10 ms then Openfire will return PEP errors.

LG

Hi LG,

Firstly, sorry about that comment about running a thrread for every user. in http://www.igniterealtime.org/issues/browse/OF-115 you said that

So one should make sure that “each client has a dedicated thread server side for processing incoming stanzas” no matter if connected directly where this is already the case or if it is connected via a CM.” I guess what you meant to say was that we have a executor service that assigns threads to various users out of a thread pool of pre-determined size… Is that correct?

Secondly, you mentioned that “Conference (MUC), Broadcast and Search are components, so Openfire can simply route packets to them as they are not really in the core”. this is what i intended my interceptor to be!! It simply routes packets based on their type. I proposed to extend this component model to other services such as PEP. I agree it may not be necessary but don’t you think I should atleast try it?? maybe it will improve performance…

“Maybe one can define “maxPepTime=10ms” and when PEP messages take longer than 10 ms then Openfire will return PEP errors.” – I guess such an approach might cause too many errors and if we let PEP messages run for long they might eat up too many resources crippling the server. So maybe we externalize PEP and give it its own set of resources and then fiddle with “maxPepTime” till we reach a stage where most PEP requests are met without many blockages. The advantage of externalizing is merely that it gives us more flexibility with “maxPepTime”.

“Maybe your picture needs to be updated a little bit”… Too right!! I have very less knowledge about how XML from user is handled initially… could you shed some light on this please (If its not too much trouble ?).

One request :

As you may have guessed i’m new to openfire… if i’m going to make this work I will require more knowledge on the internal workings of the server. I have studied the source code and the documentation and have gleaned as much information from it as I could. However, I NEED MORE INFORMATION . Please help in any way you can…

Regards,

Anshul

Based on the points you suggested I propose an alternative approach.

  1. We get the XML from the connection with the user and give it to a executor service that assigns a thread to handle transactions of this user.

(I guess this already exists.)

  1. the various services provided by openfire are implemented as different components. The main thread forwards requests to these components and gathers their response. The components have their own ExecutorServices and database connection pools and ideally operate seperately from each other.

  2. A serializer is implemented on the main thread. It checks whether the request is a part of a set in which sequence is important (how? … am working on it) if it is, then it assigns a serial no to each request and forwards it to the specific component.We have a boolean array with each entry corresponding to each serial no. Now each component before sending results to client checks whether the serial no before it has sent its data through the boolean array and accordingly sends its result. To free the thread after it has processed the request we may be able to store the result in a queue and have a generic connection sender send the result as soon as sequence issues are resolved(may be difficult to implement).

your thoughts on this?? please reply soon if possible as the deadline is approaching…

Regards,

Anshul

Hi Anshul,

maybe you should make one step back - I don’t mean to discard this project but get a little distance from it. Reflect what you know right now and whether “Constructive performance enhancements” means that one needs to re-write Openfire.
We don’t know in detail where the bottleneck(s) is(are) and thus it may help a lot to identify them. Then it’s much more easy to fix it(them).

It’s like an application with a memory leak. You can either rewrite the application and hope that the memory leak is then gone or you can try to identify and fix it.

I think of an Openfire “kernel” which does “thread scheduling” and monitors how long things take and which can disable parts of it. That sounds good and may help a lot but it’s like building a new Openfire server. But this is already a “solution” or “revolution” without knowing what was wrong in the old code. And it may take longer than one summer to be done.

Analyzing or recreating the problem should give you a very detail insight into Openfire and then it should be very easy to think about a possible solution. So I hope that you plan 1-2 weeks to analyze things (also there you can write some code to reproduce the problem) before you modify or write new Openfire code.

From my point of view the technical details how to solve the problem are still unknown so you may want to avoid mentioning them.

LG

Hi LG,

Thanks for the tip. I have modified my proposal and tried to make it as general as possible.

http://socghop.appspot.com/gsoc/student_proposal/show/google/gsoc2010/ana28192/t 127052208159

am currently trying to think of a reliable testing scheme… any pointers??

Hi,

there are some reports about performance which may not or may be related to the Achilles’ heel problem:

Jivesoftware and others did use Tsung for a load test, anyhow it seems that the configurations files are not-public.

LG

Beware! I am under the impression that you are missing an important point.

anshul.singhle wrote:

Secondly, you mentioned that “Conference (MUC), Broadcast and Search are components, so Openfire can simply route packets to them as they are not really in the core”. this is what i intended my interceptor to be!! It simply routes packets based on their type. I proposed to extend this component model to other services such as PEP.

This quote leads me to believe that you are unfamiliar with the concept of an XMPP Component. You seem to speak of a component (or “module” or “part-of-a-system”) in a generic context. LG was referring to XMPP Components. These are XMPP entities that are addressable. As they are addressable, the Openfire routing routines can handle stanzas that are sent to them in a very similar way as it handles stanzas that are sent to users (which are simply another type of XMPP entities).

For parts of Openfire that are XMPP Components, preventing the Achilles’ Heel problem is relatively simple. I have outlined this in the Achilles’ Heel document and I have provided the AbstractComponent class, which is an implementation of that solution.

In some cases, it is not appropriate and/or possible to convert other parts of Openfire to an XMPP Component. Some PacketInterceptor implementations are a notable example. One of the tasks that you could pick up as part of your proposal is figuring out if an “AbstractPacketInterceptor” kind of implementation is of benefit (and if it is, you could provide such an implementation).

There’s one issue with a solution that is based on the idea that each type of “functionality provider” will make sure that it runs in a sandbox: we need a guarantee that all of the “functionality providers” that are in use by Openfire are in fact of an implementation-type that provides this kind of sandboxing. You will need to find a solution for this problem.

You cannot expect load tests to expose *every *problem related to your Google Summer of Code proposal. I would actually be surprised if you would be able to find more than two or three new problems this way.

You can implement lots of tests, each of them of some value, but none of them will be able to accurately or reliably simulate every possible “real world” scenario - these are simply to many factors to take into account. It will take you to long to analyze the required factors and to implement them. Even if you do succeed, you will have simulated just one particular domain. Other domains, also running Openfire, will have different traffic patterns and will therefor experience other problems.

I strongly advise against creating a “generic” test frame work that you use to simulate generic “XMPP traffic” to see how Openfire performs. The results will be inconclusive and of little value (this is how Jive came to the “the Openfire cluster can handle 200,000 concurrent connections” claim). Besides, there are frameworks that do this for you (Tsung, for example).

Please note that I used bold, italic and underline to stress my point. Really. Don’t. It’s a waste of valuable time.

I have stated before: focus on verification, not simulation. Load tests (using Tsung) can and will be of value to you, but use them to see if a specific part of the system is behaving in the way you’re expecting. Use these kind of tests only to verify that a very specific problem has been fixed with the solution that you have found. Combine the usage of Tsung with a profiler, for instance, to verify that threads are running concurrently, things like that.

One last comment: I feel that you are intending to rewrite big parts of Openfire, including parts that in themselves are of little relation to the problems that you are trying to fix. Please prevent rewriting parts that do not need rewriting. For instance, the Apache MINA Framework already presents you with a number of ExecutorServices - you should re-use those if possible (and I believe this is possible).

Remember that other developers are working on the same code while you are doing GSoC. We should avoid having code merging problems when you are done!

Provided that your proposal is accepted, I would like you to structure your project in such a way that we can have (and integrate) working bits and pieces of your work along the way. I prefer to integrate a couple of things every few weeks over having to integrate everything after the end of the summer. This will help keeping us on track, as it calls for regular integration with the existing code. As others are working on that code too, you’ll get “free” reviewers this way too.

Hi,

First of all, thanks for the good discussion that hopefully helps to solve the thread pool issue. We have been using OF in a production environment for some years and are experiencing the issues that are discussed in http://www.igniterealtime.org/community/docs/DOC-1925. One of the reasons that we choose to use OF was that a company like Jive Software was driving the development and had an interest in providing a stable XMPP service. You are discussing major design modifications and there is a risk that things may go wrong. On the other hand, if you are successful your changes will probably provide the core of a new OF 4.x major release and will not just be a GSoC 2010 project.

Thanks!

Michael

Hi Michael,

Could you contact me off list? I’m very interested in the way you are using Openfire, how you’ve learned that you’re experiencing problems related to the Achilles’ Heel problem and what you have done so far in an effort to reduce the problem.

Regards,

Guus

Hello all,

As I’ve just commented on the original Join XSF’s Google Summer of Code!, none of the submitted Openfire-related proposals have made it. I’d like to thank all of you for all of the efforts that you have been putting in.

I found some of the proposals of excellent quality. Sadly, the number of slots available to the XSF was a lot less than the number of slots that were available last year. There simply wasn’t nearly enough room to accomodate all of the proposals.

As I wrote in the blogpost-comment: I’d love to see the students to join in on the development anyway! If you’re interested, drop me a note and we’ll discuss in what form we can go forward.