Migration Testimony: Julia Kreger's Experience

Julia Kreger
https://www.linkedin.com/in/juliaashleykreger/Senior Principal Software Engineer at Red Hat, Chair of the OpenStack Governing Board
Working on OpenStack Ironic
Technical Context
Which project or component did you migrate away from Eventlet?
Ironic
How deeply was Eventlet integrated into your codebase?
Ironic's codebase made heavy use of green-threads and the monkey patching provided by eventlet. Coupled with the WSGI server provided, it made a lot of sense for the Ironic project as a model to feel like we were making it simpler. In reality, we have learned Eventlet brought us more complexity, but being on the journey and learning as we go is the critical aspect.
Which framework or alternative did you choose to replace Eventlet, and why?
Ironic is a service which can be deployed in numerous use cases. As a result, we have a number of processes, some which also offer an operating model to enable Ironic's use without a RPC Message Bus like many other services which came from OpenStack. This drove us towards using cheroot which was an item of community consensus. Cheroot is by no means perfect, but it allowed us to move our JSON-RPC and Restful API endpoints without much heartache.
Motivation and Decision
What motivated you to start this migration?
Ironic chose to perform this migration largely because of the mounting evidence of issues creeping in from Eventlet. Ironic has always been very unit test heavy, and we were finding that we were seeing more and more "weird" test failures which we could begin to tell were rooted in Eventlet's operation and monkey-patching of the test and ultimately service code as it related to the Python version in use.
G-Research Open Source Software was able to lend some additional Python expertise to Ironic to help us reach an understanding of what was going on and that actually ended up in the initial call which started the groundswell to migrate off Eventlet in the OpenStack context.
Did you have any concerns or doubts before starting?
I don't think any of us had any real concerns or doubts. Obviously, before starting any major migration in a community which strives for perfection, we had to recognize and find the balance which would work for us as a group of contributors. The biggest concerns we had were rooted largely around the core of Ironic which is our conductor service because it performs the heaviest workload. It was extremely reliant upon green-threads, and we had to find that right balance. Part of that balance was gaining an understanding, then building consensus around the possible problems or risk areas which would require work. One contributor did that and then others reviewed, commented, and tried to understand their perception. In this process we were able to gain a better mutual understanding, but you can only do so much of that before you really need to experiment and begin to eliminate possible issues and begin experimenting. Once we started experimenting, the speed of our progress grew dramatically.
Migration Process
How did the migration process go? Where did you start?
So, I'll first stress this was a team effort. CID started by working through the list of areas raised by Dmitry for smaller isolated possible areas. I and others started chipping away at aspects like tests which helped model the eventlet behavior.
What tools or strategies helped you the most?
Etherpad was most likely the most critical tool that we used to help move this effort along.
Combined with starting small, iterating, and getting to a point where we could load test with some fake data really helped us gain velocity as we went.
Were there any particularly tricky or painful parts?
There were two painful and tricky parts. Perhaps three as this morning we just had reports of a breakage but it was easy to fix.
Two of the issues were largely rooted around the final step to remove Eventlet and launch Ironic without Eventlet being loaded. Turned out oslo.service introduced some changes in the process launch model, new sub-process via python multiprocessing by default instead of inside a single process. This cascaded to our custom shutdown code because this change in oslo.service introduced a change of behavior from eventlet which forced us to retool some of our shutdown handling. Like any project with a variety of use cases and customer demands, we carry some additional features and logic for shutting down our ironic-conductor service. Once we understood what we needed to do, it was fairly straightforward.
The third issue was our TaskManager and the resulting worker threads... and the resulting memory impacts.
We knew due to our internal structural model of interaction that we had to maintain the model of executing any task or user requested work item on its own thread, while also multiplexing our existing maintenance work across threads as well. We knew the basic pattern to expect. These threads were going to be IO bound. Secondly we knew that as we increased the number of threads, our memory footprint would swell. This quickly led us to a pattern of setting up several simple tests with mock data and beginning to measure the memory impact. We quickly realized we were going to have a huge issue. Many of Ironic's users are systems operators where they are improving their quality of life through Ironic. We are also highly scalable and tunable and generally recognized at a certain scale operators may need to further tune aspects. The end result, we have many configuration settings which ship with reasonable defaults but that could be tuned to support operators. And often, those operators don't actually understand how the knobs can impact their performance. For example, we've had operators come in and say "I set it to a thousand threads and each timeout to like 10 minutes" and then complain of issues. It becomes a teaching opportunity for the maintainers of Ironic. So when we launched Ironic with four hundred thread threads in a simulated load test and could trigger the host's OOM-Killer process, we knew we had to address the thread worker model. Luckily, Dmitry Tantsur was up for that challenge. This resulted in a few weeks of collaboration, testing. Ironic now has what might be a better "rejector" function than what the futurist library carries, but time will tell.
Roughly how long did the migration take?
Discussion really getting started and being executed on to completion felt about a four or five month process. Ironic did have to take a little more time to address some of its various use cases and ensure they were going to work as we expected. Which led to additional work and some undoing of some of the work as we were able to measure performance in some different cases. Ultimately, leading to a more performant and capable service in the end.
Were you able to migrate incrementally? If so, how?
Ironic started with the low-hanging-fruit approach. Like unit tests, explicit invocations of eventlet. Specific threading invocation. This then shifted to some testing, updating our code around process launch and shutdown, and then ultimately addressing the more specific use cases around Ironic like those leveraged by Metal3, in which Ironic is embedded.
Outcomes and Benefits
What concrete benefits have you seen after migrating?
First off, our unit tests now seem to be less "flaky". It was uncommon before, but I don't think I've seen a spurious failure of our unit tests since we've migrated.
Secondly, we've seen some dramatic performance improvements, but not only that we were able to identify some additional areas where performance was degraded and we just didn't really "see" it before. For example, Metal3 had a test which with Eventlet took about a minute and twenty seconds. That runtime spiked up beyond two minutes with some changes, and we were able to see some of the root causes and get that down somewhere in the neighborhood of six seconds at last report.
How did your team react to the change?
This was really a collaborative effort amongst CID, Jay Faulkner, Dmitry Tantsur, and myself for the core of the Ironic service. Ironic also has neutron plugins and some other components, and some other code and others jumped in to help. Once everyone understood the reasons and the impacts, it just made sense and we worked through it one step at a time. We did also decide not to migrate one service in our project scope, but that service was previously deprecated.
Lessons Learned
What advice would you give to a team that's hesitant to migrate?
Start small. Iterate. Try to move with purpose. Embrace the change, and don't let fear or uncertainty or fear slow you down. Just take one step at a time with a basic plan. Expect the plan to need to change. It will be okay.
Is there anything you would do differently next time?
I think it is too early to speak to this for the Ironic project. We could have moved slower, but it would have just elongated the process and made the effort harder.
Have you faced blockers? If so, which?
Our biggest blocker was in the futurist library when we realized that we needed to balance our worker threads and memory usage. Through collaboration, we were able to make progress.
Would you like to share a link to a patch, repo, or documentation?
We ended up drafting a blog post for Ironic's blog as a result of this work. In part to bring awareness to the impact, tunables, and areas operators might want to take a look at. This was critical for us as Ironic's use base is spread from "embedded" use cases to highly scaled environments. This seemed like the right balance to take in the context of the effort. https://ironicbaremetal.org/blog/coming-soon-threading/.
Final Thoughts
Is there anything else you'd like to share with the community about your experience?
Embrace change. Move forward. The grass is definitely "greener" on the side of real threads.