Day 1 at the OSPM Summit Pisa, Italy

 In Core Dump

The first summit on power management and scheduling disciplines in the Linux kernel was held at Scuola Superiore S. Anna in Pisa Italy on Monday 3 April and Tuesday 4 April 2017.  The event was organised by ARM and members of the ReTis lab.  It attracted a wide audience that spanned both the industry and academic realm. Linaro attended the conference and offers the following summary from day 1 (to view the summary from day 2, click here). To view the presentations listed below, click on the headings.

 
Tooling/LISA
By Patrick Bellasi and Brendan Jackman (slides)

The presentation started with an introduction to LISA and the motivation behind its development. It is a set of tools and scripts built on top of existing technology/frameworks. The goal is to understand the effect of change made to the scheduler and spot regressions. Everything is available on GitHub so that people can have common test cases to work with and compare results easily. What is currently available integrates different examples of analysis scripts and plots, making it easier for newcomers to quickly get started. A lot of recipes are also available. Patrick gave plenty of examples on the type of graph plots already available to use with the quality and relevance of those graphs being impressive. The library is powerful and gives a good insight as to what is happening from different perspectives. It also has good support for latency analysis. Brendan continued the presentation with more specific tools from the library, namely TRAPpy and Jupyter. The former is a python based library that provides support for rendering kernelshark-like results, while the latter is a browser based technology offering an environment where graphs can be plotted based on queries formulated by users in real-time.

The presenters admitted the learning curve is steep but the results are well worth it. Todd Kjos (Google’s Android kernel team) reported his experience saying that if what you need is already present then things are easy. Otherwise the investment becomes considerable. He also said that the current efforts to improve the documentation are definitely helping. The presentation finished with questions from the audience. The conclusion was drawn that things are bound to change a little with scheduler tracepoint modifications but the presented tools do not have strict dependencies with respect to the current trace format.

 
About The Need to Power Instrumenting The Linux Kernel
By Patrick Titiano (slides)

Patrick started his presentation with a description of the problems he is currently facing, i.e there is a lack of power data and instrumentation along with no probe points for power measurement. The HW currently available is costly and vendors aren’t usually fast to share power numbers with people. In his opinion, the situation is caused by the false belief that PM management is of no interest to people, something that can’t be further from the truth. To address the problem he suggests introducing a generic power instrumentation framework, allowing debug power management on any board without getting into expensive HW. That would help to further the modelling of power usage on current system and help design new generations.

What is needed to achieve something like that? First a common database to catalog how much devices are consuming (CPU, GPU, RAM, uart, i2c, …), then tools to plot and process the power traces generated by systems. The emphasis would be placed on keeping things generic. We currently have tools like Ftrace capable of exporting power related information but the tracepoints rarely get out.

The view of one of the participants was that this idea is dead from the start – there is already a DB available for ACPI and it is not being used. Another person stated that manufacturers know the numbers but don’t want to share the information. In summary, a lot of the infrastructure is already available, what is needed is some kind of central repository to publish power consumption data and user space tools for plotting/analysis.

 

What are the latest evolutions in PELT and what next
By Vincent Guittot (slides)

Vincent started his presentation by going over the different load tracking mechanisms currently used by the scheduling classes. CFS does so using PELT, RT using the RT average and deadline by tracking runqueues’ active utilisation. So far most of the focus has been on PELT. From there, he proceeded with a couple of graphs: one highlighting various problems with PELT in kernel 4.9 and another one with the tip kernel where fixes for those problems have been included. Interesting aspects cover a more stable utilisation along with the load and utilisation being carried with the task when moved. Things remaining to sort out include frequency invariance, the update of blocked idle loads and dropping of the utilisation metric upon DL/RT preemption and migration. On the frequency invariance front, the goal is to make min/max utilisation the same for every frequency and across architecture. That way load becomes invariant and more responsiveness is achieved from sudden load spikes. That sparked a conversation about what to do when approaching maximum CPU utilization, i.e should we go to max OPP directly or approach things from the bottom. The problem is to find the point at which it becomes worth it to boost the OPP. Regarding the updating of blocked idle load, Vincent said it needs to happen more frequently since it is used to set shares in task groups and determine OPPs when schedutil is used. He also has a prototype to track RT utilisation that adds a PELT like utilisation metric to the root RT runqueue. The presentation ended with an open-ended question about how to evaluate that a thread doesn’t have all the running time it wants. Knowing how much time tasks are waiting would be useful to know when (and by how much) to increase the operating frequency.

 

PELT decay clamping/UTIL_EST
By Morten Rasmussen and Patrick Bellasi

The problematic exposed by Morten and Patrick is that periodic tasks with very long sleep periods lose too much of their accumulated utilisation, something that leads to wrong estimates at next wake up time. Tasks that are not clamped see a very big ramp up upon waking up and as such a less responsive system. With clamping, more of the task’s history is conserved and the ramp up time to higher operating frequencies is shorter. At that point, a participant asked if long sleeping tasks can be treated as new tasks when they wake up. Morten thought it was a possible avenue but the problem is to determine how long a task needs to sleep before being considered a new task. Morten has patches that implement clamping that have a few issues to sort out but are good enough for reviewing. Overall participants were not keen on the approach. Other ideas that came up were to use PELT as an estimator and collect what was learned about previous tasks’ activation. This allows the possibility to generate a new metric on top of PELT for task and CPUs. This new metric (namely util_est) can be used to drive OPPs and better support task placement in the wake up path. Moverover, it has the advantage to keep the original PELT signal (thus not risking to break its mathematical properties) while creating a better abstraction for signals consolidation policies.

 

EAS where we are
By Morten Rasmussen

The presentation started with a short introduction on EAS and why it is important, i.e the idea is to maximise CPU utilisation and power efficiency. Then a short overview followed of why invariance scaling is needed. Morten addressed the fact that the current EAS/PELT energy model is the best thing we can do and that predicting the future based on the past is inherently wrong. It is possible to set CPU affinity for a task but all intelligence/choices made by the energy model is overwritten. The tradeoffs between performance and energy consumption is always very use case specific. Thermal is also a problem as system thermal throttling comes in the way often and will preempt decisions taken by EAS. There is currently no correlation or communication between the FW (thermal) and the scheduler (EAS). Regarding energy aware scheduling, a lot of things like schedutil, capacity-aware task wake-up and PELT group utilisation are already upstream. Discussion is now happening around SCHED_DEADLINE scale-invariance, schedtune and device tree capacity. In the long term, the issue of PELT_NOHZ updates, capacity aware load balancing and the placement of ‘misfit’ tasks are on the radar.

 

Energy model and exotic topologies
By Brendan Jackman (slides)

Brendan started his session with several figures on the EAS energy model concept and data structures. From there he proceeded to highlight the importance of cluster-level energy data for cluster packing, that is when to know that tasks should be packed together on clusters. It is easy to show the effect of cluster packing on scheduler behavior but harder to demonstrate energy savings on modern platforms. This was followed by a short overview of ARM’s DynamIQ Shared Unit (DSU). The concept involved packing different types of CPU in the same cluster, that is up to 8 CPUs that share an L3 cache with all/some/none CPUs having their own L2. Simply put the architectural topology boundaries we have seen so far are no longer congruent with frequency domains, power domains and CPU capacity boundaries. That led to a discussion on the ramification of an energy model for such heterogeneous topology.

 

Schedtune
By Patrick Bellasi (slides)

Patrick started his session with a very good description of the problem he is trying to solve. His goal is to communicate to the kernel information about user space task requirements so that existing policies for OPP selection and task placement can be improved. Taking the Pixel as a base platform, a set of concepts have been evaluated, more specifically the boosting of “top applications”, minimum capacity for specific tasks and the introduction of a “prefer_idle” for latency sensitive tasks. Preferably those would be added as extensions to existing concept. As an example, task boosting could be partially supported by the cpu.shares attribute of the CPU CGroup controller, while OPP biasing (minimum and maximum preferred CPU capacity) and perfer_idle with the introduction of new cpu.{min_capacity,max_capacity} flags. Patrick has published a “CPU utilisation clamping” patchset where the concept of OPP biasing and negative boosting are implemented with the introduction of new cpu.min_capacity and cpu.max_capacity attributes. Some of the advantages of the current proposal is that it is built on top of existing policies and the runtime overhead is negligible. From there, audience members expressed concerns about the feasibility of extending the current APIs. The concept of minimum capacity and the proper semantic to make it useful was brought forward but doubts about it being required were raised. Participants were of the opinion that other things such as PELT’s under estimation of task requirements could be improved before we get there. It was also underlined that the current energy models avoid over provisioning and efforts to address the situation are still very use case specific. The level of abstraction to describe a task’s requirement was also raised – to coarse and the model becomes inefficient and too many details risk leading to computation overhead and exposure of internal kernel specifics. The presentation concluded with an overview of the current work in progress such as finishing the task placement feature and the completion of integration with AOSP user space, i.e cleanup the current sched_policy and extend task classes to their proper mapping.

 

SCHED_DEADLINE and bandwidth reclaiming
By Luca Abeni and Juri Lelli (slides)

The presentation started with Luca talking about the SHED_DEADLINE class and the general concept behind bandwidth reclaiming. The implemented algorithm is called Greedy Reclamation of Unused Bandwidth (GRUB) as it makes it possible to reclaim the runtime unused by some deadline tasks to give other tasks more runtime than previously agreed upon, without breaking deadline guarantees nor starving non-deadline tasks. Alternatively, the reclaiming mechanism can be used for power management by lowering the CPU frequency based on the current load. Knowing how much time to reclaim can also help take better frequency scaling decisions. The current patchset determines how much to reclaim by tracking how long deadline tasks are inactive. This is currently done on a per-runqueue basis but another prototype does it globally. Another approach that tracks active utilisation was also considered but it had too many issues to be furthered. One of the main hurdles to deal with is knowing when to update the task utilisation metrics – doing so when the task blocks led to too much bandwidth being reclaimed. Instead the solution considers a blocking task to still be contributing (blocked active) until its the 0-lag time. At that point the utilisation is deduced from the total utilisation and the task is considered to be in a “blocked inactive” state. Scheduler maintainer Peter Zijlstra said he would have merged the current patchset had it not been for minor issues. Implementation optimisation still remain, notably in the area of reclaiming bandwidth for non-deadline tasks and the timely process of iterating over all active runqueues in the root domain when looking for bandwidth to reclaim. The presentation concluded with a patch-by-patch walkthrough of the current patchset and a word or two on the availability of another patchset that tracks inactive utilisation globally if people are interested.

 

Recommended Posts