[PATCH v4] power: introduce library for device-specific OPPs

Paul E. McKenney paulmck at linux.vnet.ibm.com
Fri Sep 24 17:40:03 EDT 2010


On Fri, Sep 24, 2010 at 04:26:21PM -0500, Nishanth Menon wrote:
> Paul E. McKenney had written, on 09/24/2010 02:37 PM, the following:
> [...]
> >
> >Looks like a good start!!!  Some questions and suggestions about RCU
> Thanks for the review.. few comments below..

Back at you!  ;-)

> >usage interspersed below.
> >
> >                                                        Thanx, Paul
> [...]
> >>+
> >>+/**
> >>+ * find_device_opp() - find device_opp struct using device pointer
> >>+ * @dev:     device pointer used to lookup device OPPs
> >>+ *
> >>+ * Search list of device OPPs for one containing matching device. Does a RCU
> >>+ * reader operation to grab the pointer needed.
> >>+ *
> >>+ * Returns pointer to 'struct device_opp' if found, otherwise -ENODEV or
> >>+ * -EINVAL based on type of error.
> >>+ */
> >>+static struct device_opp *find_device_opp(struct device *dev)
> >>+{
> >>+     struct device_opp *tmp_dev_opp, *dev_opp = ERR_PTR(-ENODEV);
> >>+
> >>+     if (unlikely(!dev || IS_ERR(dev))) {
> >>+             pr_err("%s: Invalid parameters being passed\n", __func__);
> >>+             return ERR_PTR(-EINVAL);
> >>+     }
> >>+
> >>+     rcu_read_lock();
> >>+     list_for_each_entry_rcu(tmp_dev_opp, &dev_opp_list, node) {
> >>+             if (tmp_dev_opp->dev == dev) {
> >>+                     dev_opp = tmp_dev_opp;
> >>+                     break;
> >>+             }
> >>+     }
> >>+     rcu_read_unlock();
> >
> >What prevents the structure pointed to by dev_opp from being freed
> >at this point?  We are no longer in an RCU read-side critical section,
> >so RCU grace periods starting during the above RCU read-side critical
> >section can now end.
> 
> dev_opp is never freed in the implementation -> it represents
> domains, only adds with list_add_rcu() is done -> wont the usage be
> safe then? or I being blind?
> 
> the reason why we dont free is coz of the following: dev_opp
> represents voltage domains in opp library. SoC frameworks are
> required to register only those voltage domain opp that are
> required. by allowing a free logic, I knew it'd have complicated the
> implementation way beyond what we needed it to be.

Perhaps I was confusing two different data structures, if so, apologies.

So you are freeing the opp level, but never the dev_opp level, then?

But yes, if you are only adding and never deleting, then it is safe to
pass the pointers out of an RCU read-side critical section.  But please
add a comment saying why you are doing this.  Otherwise, Coccinelle will
cause me to continue complaining about this to you.  ;-)

And the later uses still look buggy to me, please see below.

> >Here is an example sequence of events that I am worried about:
> >
> >o       CPU 1 enters find_device_opp(), and pick up a pointer to
> >        a given device opp.
> >
> >o       CPU 2 executes opp_set_availability(), replacing that same
> >        device opp with a new one.  It then calls synchronize_rcu()
> >        which blocks waiting for CPU 1 to exit its RCU read-side
> >        critical section.
> >
> >o       CPU 1 exits its RCU read-side critical section, arriving at
> >        this point in the code.
> >
> >o       CPU 2's synchronize_rcu() is now permitted to return, executing
> >        the kfree(), which frees up the memory that CPU 1's dev_opp
> >        pointer references.
> >
> >o       This newly freed memory is allocated for some other structure
> >        by CPU 3.  CPU 1 and CPU 3 are now trying to use the same
> >        memory for two different structures, and nothing good can
> >        possibly come of this.  The kernel dies a brutal and nasty
> >        death.
> >
> >One way to fix this is to have the caller do rcu_read_lock() before
> >calling find_device_opp(), and to do rcu_read_unlock() only after the
> >caller has finished using the pointer that find_device_opp() returns.
> >This works well unless the caller needs to do some blocking operation
> >before it gets done using the pointer.
> >
> >Another approach is for find_device_opp() to use a reference count on
> >the structure, and for opp_set_availability() to avoid freeing the
> >structure unless/until the reference counter drops to zero.
> >
> >There are other approaches as well, please feel free to take a look
> >at Documentation/RCU/rcuref.txt for more info on using reference
> >counting and RCU.
> thx. I probably should read yet again if I got my understanding of
> usage right..
> 
> >
> [...]
> >>+
> >>+/**
> >>+ * opp_find_freq_exact() - search for an exact frequency
> >>+ * @dev:             device for which we do this operation
> >>+ * @freq:            frequency to search for
> >>+ * @is_available:    true/false - match for available opp
> >>+ *
> >>+ * Searches for exact match in the opp list and returns pointer to the matching
> >>+ * opp if found, else returns ERR_PTR in case of error and should be handled
> >>+ * using IS_ERR.
> >>+ *
> >>+ * Note: available is a modifier for the search. if available=true, then the
> >>+ * match is for exact matching frequency and is available in the stored OPP
> >>+ * table. if false, the match is for exact frequency which is not available.
> >>+ *
> >>+ * This provides a mechanism to enable an opp which is not available currently
> >>+ * or the opposite as well.
> >>+ *
> >>+ * Locking: RCU reader.
> >>+ */
> >>+struct opp *opp_find_freq_exact(struct device *dev,
> >>+                                  unsigned long freq, bool available)
> >>+{
> >>+     struct device_opp *dev_opp;
> >>+     struct opp *temp_opp, *opp = ERR_PTR(-ENODEV);
> >>+
> >>+     dev_opp = find_device_opp(dev);
> >>+     if (IS_ERR(dev_opp))
> >>+             return opp;
> >>+
> >>+     rcu_read_lock();
> >>+     list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> >>+             if (temp_opp->available == available &&
> >>+                             temp_opp->rate == freq) {
> >>+                     opp = temp_opp;
> >>+                     break;
> >>+             }
> >>+     }
> >>+     rcu_read_unlock();
> >
> >But this one sadly has the same problem that find_device_opp() does.
> is the concern about opp OR about dev_opp here? I am guessing opp..
> >
> >>+     return opp;
> >>+}
> >>+
> >>+/**
> >>+ * opp_find_freq_ceil() - Search for an rounded ceil freq
> >>+ * @dev:     device for which we do this operation
> >>+ * @freq:    Start frequency
> >>+ *
> >>+ * Search for the matching ceil *available* OPP from a starting freq
> >>+ * for a device.
> >>+ *
> >>+ * Returns matching *opp and refreshes *freq accordingly, else returns
> >>+ * ERR_PTR in case of error and should be handled using IS_ERR.
> >>+ *
> >>+ * Locking: RCU reader.
> >>+ */
> >>+struct opp *opp_find_freq_ceil(struct device *dev, unsigned long *freq)
> >>+{
> >>+     struct device_opp *dev_opp;
> >>+     struct opp *temp_opp, *opp = ERR_PTR(-ENODEV);
> >>+
> >>+     if (!dev || !freq) {
> >>+             pr_err("%s: invalid param dev=%p freq=%p\n", __func__,
> >>+                             dev, freq);
> >>+             return ERR_PTR(-EINVAL);
> >>+     }
> >>+
> >>+     dev_opp = find_device_opp(dev);
> >>+     if (IS_ERR(dev_opp))
> >>+             return opp;
> >>+
> >>+     rcu_read_lock();
> >>+     list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> >>+             if (temp_opp->available && temp_opp->rate >= *freq) {
> >>+                     opp = temp_opp;
> >>+                     *freq = opp->rate;
> >>+                     break;
> >>+             }
> >>+     }
> >>+     rcu_read_unlock();
> >
> >And this one also has the same problem that find_device_opp() does.
>
> guessing opp ptr here.. am I right? if it is about device_opp, it is
> not going to be freed as I mentioned above - at least we dont give
> an function to update(hence free) it.

It really does look to me that you are passing a pointer to the thing
being freed out of an RCU read-side critical section.  If you are really
doing this, you do need to do something to fix it.  I outlined some of
the options earlier.

> >>+     return opp;
> >>+}
> >>+
> >>+/**
> >>+ * opp_find_freq_floor() - Search for a rounded floor freq
> >>+ * @dev:     device for which we do this operation
> >>+ * @freq:    Start frequency
> >>+ *
> >>+ * Search for the matching floor *available* OPP from a starting freq
> >>+ * for a device.
> >>+ *
> >>+ * Returns matching *opp and refreshes *freq accordingly, else returns
> >>+ * ERR_PTR in case of error and should be handled using IS_ERR.
> >>+ *
> >>+ * Locking: RCU reader.
> >>+ */
> >>+struct opp *opp_find_freq_floor(struct device *dev, unsigned long *freq)
> >>+{
> >>+     struct device_opp *dev_opp;
> >>+     struct opp *temp_opp, *opp = ERR_PTR(-ENODEV);
> >>+
> >>+     if (!dev || !freq) {
> >>+             pr_err("%s: invalid param dev=%p freq=%p\n", __func__,
> >>+                             dev, freq);
> >>+             return ERR_PTR(-EINVAL);
> >>+     }
> >>+
> >>+     dev_opp = find_device_opp(dev);
> >>+     if (IS_ERR(dev_opp))
> >>+             return opp;
> >>+
> >>+     rcu_read_lock();
> >>+     list_for_each_entry_rcu(temp_opp, &dev_opp->opp_list, node) {
> >>+             if (temp_opp->available) {
> >>+                     /* go to the next node, before choosing prev */
> >>+                     if (temp_opp->rate > *freq)
> >>+                             break;
> >>+                     else
> >>+                             opp = temp_opp;
> >>+             }
> >>+     }
> >>+     if (!IS_ERR(opp))
> >>+             *freq = opp->rate;
> >>+     rcu_read_unlock();
> >
> >As does this one.
>
> guessing opp ptr here.. am I right?

Again, here it looks to me like you are passing a pointer out of an RCU
read-side critical section that could be freed out from under you.  If
so, again, this must be fixed.

> >>+     return opp;
> >>+}
> >>+
> >>+/**
> >>+ * opp_add()  - Add an OPP table from a table definitions
> >>+ * @dev:     device for which we do this operation
> >>+ * @freq:    Frequency in Hz for this OPP
> >>+ * @u_volt:  Voltage in uVolts for this OPP
> >>+ *
> >>+ * This function adds an opp definition to the opp list and returns status.
> >>+ * The opp is made available by default and it can be controlled using
> >>+ * opp_enable/disable functions.
> >>+ *
> >>+ * Locking: RCU, mutex
> >>+ */
> >>+int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt)
> >>+{
> >>+     struct device_opp *tmp_dev_opp, *dev_opp = NULL;
> >>+     struct opp *opp, *new_opp;
> >>+     struct list_head *head;
> >>+
> >>+     /* Check for existing list for 'dev' */
> >>+     rcu_read_lock();
> >>+     list_for_each_entry_rcu(tmp_dev_opp, &dev_opp_list, node) {
> >>+             if (dev == tmp_dev_opp->dev) {
> >>+                     dev_opp = tmp_dev_opp;
> >>+                     break;
> >>+             }
> >>+     }
> >>+     rcu_read_unlock();
> >>+
> >>+     /* allocate new OPP node */
> >>+     new_opp = kzalloc(sizeof(struct opp), GFP_KERNEL);
> >>+     if (!new_opp) {
> >>+             pr_warning("%s: unable to allocate new opp node\n",
> >>+                     __func__);
> >>+             return -ENOMEM;
> >>+     }
> >>+
> >>+     if (!dev_opp) {
> >>+             /* Allocate a new device OPP table */
> >>+             dev_opp = kzalloc(sizeof(struct device_opp), GFP_KERNEL);
> >>+             if (!dev_opp) {
> >>+                     kfree(new_opp);
> >>+                     pr_warning("%s: unable to allocate device structure\n",
> >>+                             __func__);
> >>+                     return -ENOMEM;
> >>+             }
> >>+
> >>+             dev_opp->dev = dev;
> >>+             INIT_LIST_HEAD(&dev_opp->opp_list);
> >>+             mutex_init(&dev_opp->lock);
> >>+
> >>+             /* Secure the device list modification */
> >>+             mutex_lock(&dev_opp_list_lock);
> >>+             list_add_rcu(&dev_opp->node, &dev_opp_list);
> >>+             mutex_unlock(&dev_opp_list_lock);
> >>+             synchronize_rcu();
> >
> >You do not need to wait for an RCU grace period when adding objects, only
> >between removing them and freeing them.
>
> ouch.. my bad.. thx.. will fix
> 
> >
> >>+     }
> >>+
> >>+     /* populate the opp table */
> >>+     new_opp->dev_opp = dev_opp;
> >>+     new_opp->rate = freq;
> >>+     new_opp->u_volt = u_volt;
> >>+     new_opp->available = true;
> >>+
> >>+     /* make the dev_opp modification safe */
> >>+     mutex_lock(&dev_opp->lock);
> >>+
> >>+     rcu_read_lock();
> >>+     /* Insert new OPP in order of increasing frequency */
> >>+     head = &dev_opp->opp_list;
> >>+     list_for_each_entry_rcu(opp, &dev_opp->opp_list, node) {
> >>+             if (new_opp->rate < opp->rate)
> >>+                     break;
> >>+             else
> >>+                     head = &opp->node;
> >>+     }
> >>+     rcu_read_unlock();
> >>+
> >>+     list_add_rcu(&new_opp->node, head);
> >>+     mutex_unlock(&dev_opp->lock);
> >>+     synchronize_rcu();
> >
> >Ditto.
> thx.. will fix.
> 
> >
> >>+     return 0;
> >>+}
> >>+
> >>+/**
> >>+ * opp_set_availability() - helper to set the availability of an opp
> >>+ * @opp:             Pointer to opp
> >>+ * @availability_req:        availability status requested for this opp
> >>+ *
> >>+ * Set the availability of an OPP with an RCU operation, opp_{enable,disable}
> >>+ * share a common logic which is isolated here.
> >>+ *
> >>+ * Returns -EINVAL for bad pointers, -ENOMEM if no memory available for the
> >>+ * copy operation, returns 0 if no modifcation was done OR modification was
> >>+ * successful.
> >>+ */
> >>+static int opp_set_availability(struct opp *opp, bool availability_req)
> >>+{
> >>+     struct opp *new_opp, *tmp_opp;
> >>+     bool is_available;
> >>+
> >>+     if (unlikely(!opp || IS_ERR(opp))) {
> >>+             pr_err("%s: Invalid parameters being passed\n", __func__);
> >>+             return -EINVAL;
> >>+     }
> >>+
> >>+     new_opp = kmalloc(sizeof(struct opp), GFP_KERNEL);
> >>+     if (!new_opp) {
> >>+             pr_warning("%s: unable to allocate opp\n", __func__);
> >>+             return -ENOMEM;
> >>+     }
> >>+
> >>+     mutex_lock(&opp->dev_opp->lock);
> >>+
> >>+     rcu_read_lock();
> >>+     tmp_opp = rcu_dereference(opp);
> >>+     is_available = tmp_opp->available;
> >>+     rcu_read_unlock();
> >>+
> >>+     /* Is update really needed? */
> >>+     if (is_available == availability_req) {
> >>+             mutex_unlock(&opp->dev_opp->lock);
> >>+             kfree(tmp_opp);
> >>+             return 0;
> >>+     }
> >>+
> >>+     *new_opp = *opp;
> >>+     new_opp->available = availability_req;
> >>+     list_replace_rcu(&opp->node, &new_opp->node);
> >>+
> >>+     mutex_unlock(&opp->dev_opp->lock);
> >>+     synchronize_rcu();
> >
> >If you decide to rely on reference counts to fix the problem in
> >find_device_opp(), you will need to check the reference counts here.
> >Again, please see Documentation/RCU/rcuref.txt.
> 
> Does the original point about not needing to free dev_opp resolve this?

For the dev_opp case, yes.  However, I believe that my point is still
valid for the opp case.

> >>+     kfree(opp);
> >>+
> >>+     return 0;
> >>+}
> >>+
> >>+/**
> >>+ * opp_enable() - Enable a specific OPP
> >>+ * @opp:     Pointer to opp
> >>+ *
> >>+ * Enables a provided opp. If the operation is valid, this returns 0, else the
> >>+ * corresponding error value. It is meant to be used for users an OPP available
> >>+ * after being temporarily made unavailable with opp_disable.
> >>+ *
> >>+ * Locking: RCU, mutex
> >
> >By "Locking: RCU", you presumably don't mean that the caller must do
> >an rcu_read_lock() -- this would result in a synchronize_rcu() being
> >invoked in an RCU read-side critical section, which is illegal.
> aye..thx. I will make it more verbose. Does the following sound right?
> 
> Locking used internally: RCU copy-update and read_lock used, mutex
> 
> and for the readers:
> 
> Locking used internally: RCU read_lock used
> 
> or do we need to go all verbatim about the implementation here?
> 
> I intended the user to know the context in which they can call it,
> for example, since mutex is used, dont think of using this in
> interrupt context. since read_locks are already used, dont need to
> double lock it.. opp library takes care of it's own exclusivity.

I would stick to the constraints on the caller, and describe the internals
elsewhere, for example, near the data-structure definitions.  But tastes
do vary on this.

> >>+ */
> >>+int opp_enable(struct opp *opp)
> >>+{
> >>+     return opp_set_availability(opp, true);
> >>+}
> >>+
> >>+/**
> >>+ * opp_disable() - Disable a specific OPP
> >>+ * @opp:     Pointer to opp
> >>+ *
> >>+ * Disables a provided opp. If the operation is valid, this returns
> >>+ * 0, else the corresponding error value. It is meant to be a temporary
> >>+ * control by users to make this OPP not available until the circumstances are
> >>+ * right to make it available again (with a call to opp_enable).
> >>+ *
> >>+ * Locking: RCU, mutex
> >
> >Ditto.  (And similar feedback applies elsewhere.)
> >
> >>+ */
> >>+int opp_disable(struct opp *opp)
> >>+{
> >>+     return opp_set_availability(opp, false);
> >>+}
> >>+
> >>+#ifdef CONFIG_CPU_FREQ
> >>+/**
> >>+ * opp_init_cpufreq_table() - create a cpufreq table for a device
> >>+ * @dev:     device for which we do this operation
> >>+ * @table:   Cpufreq table returned back to caller
> >>+ *
> >>+ * Generate a cpufreq table for a provided device- this assumes that the
> >>+ * opp list is already initialized and ready for usage.
> >>+ *
> >>+ * This function allocates required memory for the cpufreq table. It is
> >>+ * expected that the caller does the required maintenance such as freeing
> >>+ * the table as required.
> >>+ *
> >>+ * WARNING: It is  important for the callers to ensure refreshing their copy of
> >>+ * the table if any of the mentioned functions have been invoked in the interim.
> >>+ *
> >>+ * Locking: RCU reader
> >>+ */
> >>+void opp_init_cpufreq_table(struct device *dev,
> >>+                         struct cpufreq_frequency_table **table)
> >>+{
> >>+     struct device_opp *dev_opp;
> >>+     struct opp *opp;
> >>+     struct cpufreq_frequency_table *freq_table;
> >>+     int i = 0;
> >>+
> >>+     dev_opp = find_device_opp(dev);
> >>+     if (IS_ERR(dev_opp)) {
> >>+             pr_warning("%s: unable to find device\n", __func__);
> >>+             return;
> >>+     }
> >>+
> >>+     freq_table = kzalloc(sizeof(struct cpufreq_frequency_table) *
> >>+                          (opp_get_opp_count(dev) + 1), GFP_ATOMIC);
> >>+     if (!freq_table) {
> >>+             pr_warning("%s: failed to allocate frequency table\n",
> >>+                        __func__);
> >>+             return;
> >
> >How does the caller tell that the allocation failed?  Should the caller
> >set the pointer passed in through the "table" argument to NULL before
> >calling this function?  Or should this function set *table to NULL
> >before returning in this case?
> 
> Good catch. Thanks. I would rather change the return to int and pass
> proper errors to caller so that they can handle it appropriately.

Works for me!

							Thanx, Paul



More information about the linux-arm-kernel mailing list