[PATCH v8 2/3] CMDQ: Mediatek CMDQ driver

Fri Jun 17 08:57:53 PDT 2016

On 17/06/16 10:28, Horng-Shyang Liao wrote:
> Hi Matthias,
>
> On Tue, 2016-06-14 at 20:07 +0800, Horng-Shyang Liao wrote:
>> Hi Matthias,
>>
>> On Tue, 2016-06-14 at 12:17 +0200, Matthias Brugger wrote:
>>>
>>> On 14/06/16 09:44, Horng-Shyang Liao wrote:
>>>> Hi Matthias,
>>>>
>>>> On Wed, 2016-06-08 at 17:35 +0200, Matthias Brugger wrote:
>>>>>
>>>>> On 08/06/16 14:25, Horng-Shyang Liao wrote:
>>>>>> Hi Matthias,
>>>>>>
>>>>>> On Wed, 2016-06-08 at 12:45 +0200, Matthias Brugger wrote:
>>>>>>>
>>>>>>> On 08/06/16 07:40, Horng-Shyang Liao wrote:
>>>>>>>> Hi Matthias,
>>>>>>>>
>>>>>>>> On Tue, 2016-06-07 at 18:59 +0200, Matthias Brugger wrote:
>>>>>>>>>
>>>>>>>>> On 03/06/16 15:11, Matthias Brugger wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +            smp_mb(); /* modify jump before enable thread */
>>>>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +        cmdq_thread_writel(thread, task->pa_base +
>>>>>>>>>>>>>>>>>>> task->command_size,
>>>>>>>>>>>>>>>>>>> +                   CMDQ_THR_END_ADDR);
>>>>>>>>>>>>>>>>>>> +        cmdq_thread_resume(thread);
>>>>>>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>>>>>>> +    list_move_tail(&task->list_entry, &thread->task_busy_list);
>>>>>>>>>>>>>>>>>>> +    spin_unlock_irqrestore(&cmdq->exec_lock, flags);
>>>>>>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +static void cmdq_handle_error_done(struct cmdq *cmdq,
>>>>>>>>>>>>>>>>>>> +                   struct cmdq_thread *thread, u32 irq_flag)
>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>> +    struct cmdq_task *task, *tmp, *curr_task = NULL;
>>>>>>>>>>>>>>>>>>> +    u32 curr_pa;
>>>>>>>>>>>>>>>>>>> +    struct cmdq_cb_data cmdq_cb_data;
>>>>>>>>>>>>>>>>>>> +    bool err;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +    if (irq_flag & CMDQ_THR_IRQ_ERROR)
>>>>>>>>>>>>>>>>>>> +        err = true;
>>>>>>>>>>>>>>>>>>> +    else if (irq_flag & CMDQ_THR_IRQ_DONE)
>>>>>>>>>>>>>>>>>>> +        err = false;
>>>>>>>>>>>>>>>>>>> +    else
>>>>>>>>>>>>>>>>>>> +        return;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +    curr_pa = cmdq_thread_readl(thread, CMDQ_THR_CURR_ADDR);
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +    list_for_each_entry_safe(task, tmp, &thread->task_busy_list,
>>>>>>>>>>>>>>>>>>> +                 list_entry) {
>>>>>>>>>>>>>>>>>>> +        if (curr_pa >= task->pa_base &&
>>>>>>>>>>>>>>>>>>> +            curr_pa < (task->pa_base + task->command_size))
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What are you checking here? It seems as if you make some implcit
>>>>>>>>>>>>>>>>>> assumptions about pa_base and the order of execution of
>>>>>>>>>>>>>>>>>> commands in the
>>>>>>>>>>>>>>>>>> thread. Is it save to do so? Does dma_alloc_coherent give any
>>>>>>>>>>>>>>>>>> guarantees
>>>>>>>>>>>>>>>>>> about dma_handle?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. Check what is the current running task in this GCE thread.
>>>>>>>>>>>>>>>>> 2. Yes.
>>>>>>>>>>>>>>>>> 3. Yes, CMDQ doesn't use iommu, so physical address is continuous.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, physical addresses might be continous, but AFAIK there is no
>>>>>>>>>>>>>>>> guarantee that the dma_handle address is steadily growing, when
>>>>>>>>>>>>>>>> calling
>>>>>>>>>>>>>>>> dma_alloc_coherent. And if I understand the code correctly, you
>>>>>>>>>>>>>>>> use this
>>>>>>>>>>>>>>>> assumption to decide if the task picked from task_busy_list is
>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>> executing. So I think this mecanism is not working.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't use dma_handle address, and just use physical addresses.
>>>>>>>>>>>>>>>        From CPU's point of view, tasks are linked by the busy list.
>>>>>>>>>>>>>>>        From GCE's point of view, tasks are linked by the JUMP command.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In which cases does the HW thread raise an interrupt.
>>>>>>>>>>>>>>>> In case of error. When does CMDQ_THR_IRQ_DONE get raised?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> GCE will raise interrupt if any task is done or error.
>>>>>>>>>>>>>>> However, GCE is fast, so CPU may get multiple done tasks
>>>>>>>>>>>>>>> when it is running ISR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In case of error, that GCE thread will pause and raise interrupt.
>>>>>>>>>>>>>>> So, CPU may get multiple done tasks and one error task.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think we should reimplement the ISR mechanism. Can't we just read
>>>>>>>>>>>>>> CURR_IRQ_STATUS and THR_IRQ_STATUS in the handler and leave
>>>>>>>>>>>>>> cmdq_handle_error_done to the thread_fn? You will need to pass
>>>>>>>>>>>>>> information from the handler to thread_fn, but that shouldn't be an
>>>>>>>>>>>>>> issue. AFAIK interrupts are disabled in the handler, so we should stay
>>>>>>>>>>>>>> there as short as possible. Traversing task_busy_list is expensive, so
>>>>>>>>>>>>>> we need to do it in a thread context.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Actually, our initial implementation is similar to your suggestion,
>>>>>>>>>>>>> but display needs CMDQ to return callback function very precisely,
>>>>>>>>>>>>> else display will drop frame.
>>>>>>>>>>>>> For display, CMDQ interrupt will be raised every 16 ~ 17 ms,
>>>>>>>>>>>>> and CMDQ needs to call callback function in ISR.
>>>>>>>>>>>>> If we defer callback to workqueue, the time interval may be larger than
>>>>>>>>>>>>> 32 ms.sometimes.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think the problem is, that you implemented the workqueue as a ordered
>>>>>>>>>>>> workqueue, so there is no parallel processing. I'm still not sure why
>>>>>>>>>>>> you need the workqueue to be ordered. Can you please explain.
>>>>>>>>>>>
>>>>>>>>>>> The order should be kept.
>>>>>>>>>>> Let me use mouse cursor as an example.
>>>>>>>>>>> If task 1 means move mouse cursor to point A, task 2 means point B,
>>>>>>>>>>> and task 3 means point C, our expected result is A -> B -> C.
>>>>>>>>>>> If the order is not kept, the result could become A -> C -> B.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Got it, thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think a way to get rid of the workqueue is to use a timer, which gets
>>>>>>>>> programmed to the time a timeout in the first task in the busy list
>>>>>>>>> would happen. Everytime we update the busy list (e.g. because of task
>>>>>>>>> got finished by the thread), we update the timer. When the timer
>>>>>>>>> triggers, which hopefully won't happen too often, we return timeout on
>>>>>>>>> the busy list elements, until the time is lower then the actual time.
>>>>>>>>>
>>>>>>>>> At least with this we can reduce the data structures in this driver and
>>>>>>>>> make it more lightweight.
>>>>>>>>
>>>>>>>>     From my understanding, your proposed method can handle timeout case.
>>>>>>>>
>>>>>>>> However, the workqueue is also in charge of releasing tasks.
>>>>>>>> Do you take releasing tasks into consideration by using the proposed
>>>>>>>> timer method?
>>>>>>>> Furthermore, I think the code will become more complex if we also use
>>>>>>>> timer to implement releasing tasks.
>>>>>>>>
>>>>>>>
>>>>>>> Can't we call
>>>>>>>             clk_disable_unprepare(cmdq->clock);
>>>>>>>             cmdq_task_release(task);
>>>>>>> after invoking the callback?
>
> After I put clk_disable_unprepare(cmdq->clock) into ISR, I encounter
> another BUG.
>
> (Quote some Linux 4.7 source code.)
>
>   605 void clk_unprepare(struct clk *clk)
>   606 {
>   607         if (IS_ERR_OR_NULL(clk))
>   608                 return;
>   609
>   610         clk_prepare_lock();                      // <-- Here
>   611         clk_core_unprepare(clk->core);
>   612         clk_prepare_unlock();
>   613 }
>   614 EXPORT_SYMBOL_GPL(clk_unprepare);
>
>    91 static void clk_prepare_lock(void)
>    92 {
>    93         if (!mutex_trylock(&prepare_lock)) {     // <-- Here
>    94                 if (prepare_owner == current) {
>    95                         prepare_refcnt++;
>    96                         return;
>    97                 }
>    98                 mutex_lock(&prepare_lock);
>    99         }
>   100         WARN_ON_ONCE(prepare_owner != NULL);
>   101         WARN_ON_ONCE(prepare_refcnt != 0);
>   102         prepare_owner = current;
>   103         prepare_refcnt = 1;
>   104 }
>
> So, 'unprepare' can sleep and cannot be put into ISR.
> I also try to put it into a timer, but the error is the same
> since timer callback is executed by softirq.
>
> We need clk_disable_unprepare() since it can save power consumption
> in idle.

We can call clk_prepare in probe and then use clk_enable/clk_disable, 
which don't sleep.

Regards,
Matthias

> Therefore, I plan to
> (1) move releasing buffer and task into ISR,
> (2) move timeout into timer, and
> (3) keep workqueue for clk_disable_unprepare().
>
> What do you think?
>
> Thanks,
> HS
>
>>>>>>
>>>>>> Do you mean just call these two functions in ISR?
>>>>>> My major concern is dma_free_coherent() and kfree() in
>>>>>> cmdq_task_release(task).
>>>>>
>>>>> Why do we need the dma calls at all? Can't we just calculate the
>>>>> physical address using __pa(x)?
>>>>
>>>> I prefer to use dma_map_single/dma_unmap_single.
>>>>
>>>
>>> Can you please elaborate why you need this. We don't do dma, so we
>>> should not use dma memory for this.
>>
>> We need a buffer to share between CPU and GCE, so we do need DMA.
>> CPU is in charge of writing GCE commands into this buffer.
>> GCE is in charge of reading and running GCE commands from this buffer.
>> When we chain CMDQ tasks, we also need to modify GCE JUMP command.
>> Therefore, I prefer to use dma_alloc_coherent and dma_free_coherent.
>>
>> However, if we want to use timer to handle timeout, we need to release
>> memory in ISR.
>> In this case, using kmalloc/kfree + dma_map_single/dma_unmap_single
>> instead of dma_alloc_coherent/dma_free_coherent is an alternative
>> solution, but taking care the synchronization between cache and memory
>> is the expected overhead.
>>
>>>>>> Therefore, your suggestion is to use GFP_ATOMIC for both
>>>>>> dma_alloc_coherent() and kzalloc(). Right?
>>>>>
>>>>> I don't think we need GFP_ATOMIC, the critical path will just free the
>>>>> memory.
>>>>
>>>> I tested these two functions, and kfree was safe.
>>>> However, dma_free_coherent raised BUG.
>>>> BUG: failure at
>>>> /mnt/host/source/src/third_party/kernel/v3.18/mm/vmalloc.c:1514/vunmap()!
>>>
>>> Just a general hint. Please try to evaluate on a recent kernel. It looks
>>> like as if you tried this on a v3.18 based one.
>>
>> This driver should be backward compatible to v3.18 for a MTK project.
>>
>>> Best regards,
>>> Matthias
>>
>> Thanks,
>> HS
>>
>>>> 1512 void vunmap(const void *addr)
>>>> 1513 {
>>>> 1514         BUG_ON(in_interrupt());		// <-- here
>>>> 1515         might_sleep();
>>>> 1516         if (addr)
>>>> 1517                 __vunmap(addr, 0);
>>>> 1518 }
>>>> 1519 EXPORT_SYMBOL(vunmap);
>>>>
>>>> Therefore, I plan to use kmalloc + dma_map_single instead of
>>>> dma_alloc_coherent, and dma_unmap_single + kfree instead of
>>>> dma_free_coherent.
>>>>
>>>> What do you think about the function replacement?
>>>>
>>>>>> If so, I can try to implement timeout by timer, and discuss with you
>>>>>> if I have further questions.
>>>>>>
>>>>>
>>>>> Sounds good :)
>>>>>
>>>>> Thanks,
>>>>> Matthias
>>>>
>>>> Thanks,
>>>> HS
>>>>
>>>>>>> Regrading the clock, wouldn't it be easier to handle the clock
>>>>>>> enable/disable depending on the state of task_busy_list? I suppose we
>>>>>>> can't as we would need to check the task_busy_list of all threads, right?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Matthias
>>>>>>
>>>>>> Thanks,
>>>>>> HS
>>>>>>
>>>>
>>>>
>>
>>
>
>