[Patch v6 03/12] docs: qcom: Add qualcomm minidump guide
Bryan O'Donoghue
bryan.odonoghue at linaro.org
Tue Nov 28 05:15:43 PST 2023
On 24/11/2023 22:19, Mukesh Ojha wrote:
> Add the qualcomm minidump guide for the users which tries to cover
> the dependency, API use and the way to test and collect minidump
> on Qualcomm supported SoCs.
>
> Signed-off-by: Mukesh Ojha <quic_mojha at quicinc.com>
> ---
> Documentation/admin-guide/index.rst | 1 +
> Documentation/admin-guide/qcom_minidump.rst | 272 ++++++++++++++++++++++++++++
> 2 files changed, 273 insertions(+)
> create mode 100644 Documentation/admin-guide/qcom_minidump.rst
General nit-pick on sequencing of your patches. Its good practice (tm)
to put your documentation first in your series.
> diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
> index 43ea35613dfc..251d070486c2 100644
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -120,6 +120,7 @@ configure specific aspects of kernel behavior to your liking.
> perf-security
> pm/index
> pnp
> + qcom_minidump
> rapidio
> ras
> rtc
> diff --git a/Documentation/admin-guide/qcom_minidump.rst b/Documentation/admin-guide/qcom_minidump.rst
> new file mode 100644
> index 000000000000..b492f2b79639
> --- /dev/null
> +++ b/Documentation/admin-guide/qcom_minidump.rst
> @@ -0,0 +1,272 @@
> +Qualcomm minidump feature
> +=========================
> +
> +Introduction
> +------------
> +
> +Minidump is a best effort mechanism to collect useful and predefined
> +data for first level of debugging on end user devices running on
> +Qualcomm SoCs.
What does "first-level debugging" mean here ? You use the term
"post-mortem" later, stick with one term for consistency throughout your
document.
Suggest:
"Minidump is a best-effort mechanism to collect useful predefined data
for post-mortem debugging on a Qualcomm System on Chips (SoCs)."
Generally better to stick with "Qualcomm SoC" and "Minidump" once you
establish the terms upfront and early in your text - instead of
reverting to "device" and "it".
It is built on the premise that System on Chip (SoC)
> +or subsystem part of SoC crashes, due to a range of hardware and
> +software bugs.
Instead of saying "It" say "Minidump"
Hence, the ability to collect accurate data is only
> +a best-effort. The data collected could be invalid or corrupted, data
> +collection itself could fail, and so on.
"and so on" is a redundancy, drop.
Suggest:
Minidump is built on the premise that a hardware or software component
on the SoC has encountered an unexpected fault. This means that the data
collected by minidump cannot be assumed to be correct or even present.
> +
> +Qualcomm devices in engineering mode provides a mechanism for generating
> +full system RAM dumps for post-mortem debugging.
Stick with the established naming convention
"Qualcomm SoCs in engineering mode provide a mechanism for generating
full system RAM dumps for this post-mortem debugging"
But in some cases it's
> +however not feasible to capture the entire content of RAM. The minidump
> +mechanism provides the means for selected region should be included in
> +the ramdump.
Dont' start a sentence with "But"
You're also not being clear what you mean by "the entire content of RAM"
since you obviously can't capture a full snap-shot of DRAM and store in
DRAM.
Better to say IMO
"Minidump captures specific pre-defined regions of RAM and stores those
regions in a reserved Minidump specific buffer."
"The region of RAM used to store Minidump data shall be referred to as
SMEM throughout this document." [1]
> +::
> +
> + +-----------------------------------------------+
> + | DDR +-------------+ |
Instead of saying "DDR" just mark this as RAM or Memory.
Its a terrible nit-pick from me but DDR = "Double Data Rate Synchronous
Dynamic Random-Access Memory" but that's irrelevant to this
specification. We could be living in a gigantic SRAM for argument sake.
> + | | SS0-ToC| |
> + | +----------------+ +----------------+ | |
> + | |Shared memory | | SS1-ToC| | |
> + | |(SMEM) | | | | |
> + | | | +-->|--------+ | | |
> + | |G-ToC | | | SS-ToC \ | | |
> + | |+-------------+ | | | +-----------+ | | |
> + | ||-------------| | | | |-----------| | | |
> + | || SS0-ToC | | | +-|<|SS1 region1| | | |
> + | ||-------------| | | | | |-----------| | | |
> + | || SS1-ToC |-|>+ | | |SS1 region2| | | |
> + | ||-------------| | | | |-----------| | | |
> + | || SS2-ToC | | | | | ... | | | |
> + | ||-------------| | | | |-----------| | | |
> + | || ... | | |-|<|SS1 regionN| | | |
> + | ||-------------| | | | |-----------| | | |
> + | || SSn-ToC | | | | +-----------+ | | |
> + | |+-------------+ | | | | | |
> + | | | | |----------------| | |
> + | | | +>| regionN | | |
> + | | | | |----------------| | |
> + | +----------------+ | | | | |
> + | | |----------------| | |
> + | +>| region1 | | |
> + | |----------------| | |
> + | | | | |
> + | |----------------|-+ |
> + | | region5 | |
> + | |----------------| |
> + | | | |
> + | Region information +----------------+ |
> + | +---------------+ |
> + | |region name | |
> + | |---------------| |
> + | |region address | |
> + | |---------------| |
> + | |region size | |
> + | +---------------+ |
> + +-----------------------------------------------+
> + G-ToC: Global table of contents
> + SS-ToC: Subsystem table of contents
> + SS0-SSn: Subsystem numbered from 0 to n
You need to be more consistent here
SSX-ToC -> SSX-SSn
Where X is an integer from 0 upwards and similarly n is a integer from 0
upwards.
This name SSX-SSn is not especially descriptive, I'm not sure if this is
a name you are choosing here ? If so then let me suggest a new name like
"Subsystem Memory Segment" SSX-MSn
->
G-ToC: Global table of contents
SSX-ToC: Subsystem X table of contents.
X is an integer in the range of 0 to ?
Is there an upper limit ?
Presumably this is an 8, 16, 32 or 64 bit integer
Please define either the size of the integer or the
valid range of values 0..128, 0..256
SSX-MSn: Subsystem numbered from 0 to n
Same comment for the 'n' here.
> +
> +It depends on SoC where the underlying firmware is keeping the
> +minidump global table taking care of subsystem ToC part for
> +minidump like for above diagram, it is for shared memory sitting
> +in DDR and it is shared among various master however it is possible
> +that this could be implemented via memory mapped regions but the
> +general idea should remain same. Here, various subsystem could be
> +DSP's like ADSP/CDSP/MODEM etc, along with Application processor
> +(APSS) where Linux runs.
DSP minidump gets collected when DSP's goes
> +for recovery followed by a crash. The minidump part of code for
> +that resides in ``qcom_rproc_minidump.c``.
This paragraph is difficult to parse.
What you are describing here is a linked list, I think you should have a
paragraph describing how the memory structure works
->
"Minidump determines which areas of DRAM to capture via a Minidump
defined linked-list structure.
At the top level a Global Table of Contents (GTOC) enumerates a variable
number of SubSystem Table Of Contents (SSTOC) structures.
Each SSTOC contains a list of SubSystem Memory Segements which are named
according to the containing SSTOC hence (SSX-MSn) where "X" denotes the
SystemSystem index of the containing SSX-ToC and "n" denotes an
individual Memory Segment within the SystemSystem. Hence SS0-MS0 belongs
to SS0-ToC whereas SS1-MS0 belongs to SS1-ToC."
Then I think you can describe how the crash dump colleciton works and
which agents of the system - DSP ? is responsible for collecting the
crashdump
->
"The Application Processor SubSystem (APSS) runs the Linux kernel and is
therefore not responsible for assembling Minidump data. One of the other
system agents in the SoC will be responsible for capturing the Minidump
data during system reset.
Typically one of the SoC Digital Signal Processors (DSP) will be used
for this purpose.
During reset the DSP will walk the GTOC, SSX-ToCs and SSX-MSns
populating the Minidump RAM area with the indicated memory"
> +
> +
> +SMEM as backend
> +----------------
> +
> +In this document, SMEM will be used as the backend implementation
> +of minidump.
[1] As per the above link, you need to introduce the term SMEM earlier.
It's fine to expand on its meaning later but, do please define it once
upfront before you use it in your awesome ASCII art.
> +The core of minidump feature is part of Qualcomm's boot firmware code.
> +It initializes shared memory (SMEM), which is a part of DDR and
> +allocates a small section of it to minidump table, i.e. also called
> +global table of contents (G-ToC). Each subsystem (APSS, ADSP, ...) has
> +its own table of segments to be included in the minidump, all
> +references from a descriptor in SMEM (G-ToC). Each segment/region has
> +some details like name, physical address and its size etc. and it
> +could be anywhere scattered in the DDR.
->
"The SoC's bootloader must reserve an area of RAM as SMEM prior to
handing over control to the run-time operating system. The bootloader is
responsible to place the GTOC at the starting address of SMEM."
If you want to give more technical details of size, physical address -
then explicitly define those in the section above which talks about the
linked-list structure.
Please try to avoid use of "etc" or "and so on" since it assumes the
reader already knows how the system works and can fill in the blanks
but, what you are doing here is educating a Minidump novice in how
things work.
> +
> +Qualcomm APSS Minidump kernel driver concept
> +--------------------------------------------
> +
> +Qualcomm APSS minidump kernel driver adds the capability to add Linux
So why "Minidump" and then "minidump" choose one.
> +region to be dumped as part of RAM dump collection.
OK so this really is the "meat" of the system. The bootloader/firmware
populates the GTOC.
The Q this document should probably answer is how the kernel driver
knows how/where to place its data.
Assumed to be parsing the DTB.
At the moment,
> +shared memory driver creates platform device for minidump driver and
> +give a means to APSS minidump to initialize itself on probe.
"At the moment" is another drop.
Just make a clear statement
"The minidump platform driver populates the APSS porition of the GTOC"
more interesting to me is - are there defined numbers, identifiers for
the APSS ? or do we just add new entries to the GTOC ?
ie. is there a reserved index or "type" in the GTOC that identifies
where the APSS needs to insert itself ?
> +This driver provides ``qcom_minidump_region_register`` and
> +``qcom_minidump_region_unregister`` API's to register and unregister
> +APSS minidump region.
Why does it do that ? Is it not the case that the driver knows where the
APSS data goes ?
It also supports registration for the clients
> +who came before minidump driver was initialized. It maintains pending
> +list of clients who came before minidump and once minidump is initialized
> +it registers them in one go.
Don't start sentences with "It" -> "The driver" or "Minidump"
As I read this though, the Minidump driver in Linux isn't just
registering / managing the APSS side of things but also "doing stuff"
for other clients ?
How does the Linux driver know what to register ?
> +
> +To simplify post-mortem debugging, driver creates and maintain an ELF
the driver creates and maintains
> +header as first region that gets updated each time a new region gets
> +registered.
as the first region
So - who is registering these regions ? Linux kernel drivers ? aDSP / cDSP ?
If I write a new driver for Venus or GPU can I define my own region(s)
to be captured ?
Presumably. Please give more detail on this.
> +
> +The solution supports extracting the RAM dump/minidump produced either
> +over USB or stored to an attached storage device.
What provides that functionality ? The bootloader ?
How do you trigger / capture that dump from the bootloader ?
No need to go into super-detail but give some idea.
> +
> +Dependency of minidump kernel driver
> +------------------------------------
> +
> +It is to note that whole of minidump depends on Qualcomm boot firmware
> +whether it supports minidump or not.
You can drop this - you've already stated the bootloader/firmware must
setup the initial table so, you're not providing additional information
with this statement.
> So, if the minidump SMEM ID is
Try not to start sentences with "So"
SMEM ID ? This is your first time using this term - please relate it
back to your ASCII diagram and the description you give with that text.
> +present in shared memory, it indicates that minidump is supported from
> +boot firmware and it is possible to dump Linux (APSS) region as part
> +of minidump collection.
If _which_ SMEM ID ?
It seems to me as if we are missing some important information here -
what are the list of SMEM IDs ?
Are these IDs serial and incrementing across SoC versions or SoC specific ?
You need to define that data.
> +How a kernel client driver can register region with minidump
> +------------------------------------------------------------
Answering yes to my earlier question. A driver I write can make use of
the API you are providing here.
Great. Please give some indication of that earlier, even if its a
reference to this description you give here "See X.Y later in this document"
> +
> +Client driver can use ``qcom_minidump_region_register`` API's to register
> +and ``qcom_minidump_region_unregister`` to unregister their region from
> +minidump driver.
> +
> +Client needs to fill their region by filling ``qcom_minidump_region``
> +structure object which consists of the region name, region's virtual
> +and physical address and its size.
Nit pick. You need a definite article here "A client driver" etc.
> +
> +Below, is one sample client driver snippet which tries to allocate a
> +region from kernel heap of certain size and it writes a certain known
> +pattern.
Good
(that can help in verification after collection that we got
> +the exact pattern, what we wrote) and registers it with minidump.
Not necessary to define this. We are all smart here and by now the
intent of the mechanism is defined..
> +
> + .. code-block:: c
> +
> + #include <soc/qcom/qcom_minidump.h>
> + [...]
> +
> +
> + [... inside a function ...]
> + struct qcom_minidump_region region;
> +
> + [...]
> +
> + client_mem_region = kzalloc(region_size, GFP_KERNEL);
> + if (!client_mem_region)
> + return -ENOMEM;
> +
> + [... Just write a pattern ...]
> + memset(client_mem_region, 0xAB, region_size);
> +
> + [... Fill up the region object ...]
> + strlcpy(region.name, "REGION_A", sizeof(region.name));
> + region.virt_addr = client_mem_region;
> + region.phys_addr = virt_to_phys(client_mem_region);
> + region.size = region_size;
> +
> + ret = qcom_minidump_region_register(®ion);
> + if (ret < 0) {
> + pr_err("failed to add region in minidump: err: %d\n", ret);
> + return ret;
> + }
> +
> + [...]
> +
> +
> +Test
> +----
Testing
> +
> +Existing Qualcomm devices already supports entire RAM dump (also called
> +full dump) by writing appropriate value to Qualcomm's top control and
> +status register (tcsr) in ``driver/firmware/qcom_scm.c`` .
"Existing Qualcomm SoCs already support dumping the entire RAM to the
SMEM area/segment/whatever"
This is 100% counter-intuitive since SMEM lives in RAM, correct ?
Full dump means what, a full dump of the APSS RAM ? What happens if SMEM
cannot accommodate the full APSS RAM dump ?
> +
> +SCM device Tree bindings required to support download mode
> +For example (sm8450) ::
> +
> + / {
> +
> + [...]
> +
> + firmware {
> + scm: scm {
> + compatible = "qcom,scm-sm8450", "qcom,scm";
> + [... tcsr register ... ]
> + qcom,dload-mode = <&tcsr 0x13000>;
> +
> + [...]
> + };
> + };
> +
> + [...]
> +
> + soc: soc at 0 {
> +
> + [...]
> +
> + tcsr: syscon at 1fc0000 {
> + compatible = "qcom,sm8450-tcsr", "syscon";
> + reg = <0x0 0x1fc0000 0x0 0x30000>;
> + };
> +
> + [...]
> + };
> + [...]
> +
> + };
> +
> +User of minidump can pass ``qcom_scm.download_mode="mini"`` to kernel
> +commandline to set the current download mode to minidump.
"A kernel command line parameter is provided
``qcom_scm.download_mode="mini"`` to facilitate ... but you aren't
telling us what "minidump" captures "the current download" ? do you mean
the current state ?
Does the system continue to boot up if you pass
qcom_scm.download_mode="mini ? will additional registrations to
SMEM/Minidump work ?
What happens to the minidump data if there is a _subsequent_ real
crashdump ?
Overwritten ?
Also what happens if SMEM runs out of space ? Say I boot with
``qcom_scm.download_mode="mini"`` and then the system crashes - SMEM has
a limit right ?
So the minidump gets overwritten ?
> +Similarly, ``"full"`` is passed to set the download mode to full dump
> +where entire RAM dump will be collected while setting it ``"full,mini"``
> +will collect minidump along with fulldump.
Still not super-clear what the difference between mini and full is here.
> +
> +Writing to sysfs node can also be used to set the mode to minidump::
> +
> + echo "mini" > /sys/module/qcom_scm/parameter/download_mode
> +
> +Once the download mode is set, any kind of crash will make the device collect
> +respective dump as per set download mode.
Nice.
> +
> +Dump collection
> +---------------
> +::
> +
> + +-----------+
> + | |
> + | | +------+
> + | | | |
> + | | +--+---+ Product(Qualcomm SoC)
> + +-----------+ |
> + |+++++++++++|<------------+
> + |+++++++++++| usb cable
> + +-----------+
> + x86_64 PC
> +
> +The solution supports a product running with Qualcomm SoC (where minidump)
> +is supported from the firmware) connected to x86_64 host PC running PCAT
> +tool.
It supports downloading the minidump produced from product to the
> +host PC over USB or to save the minidump to the product attached storage
> +device(UFS/eMMC/SD Card) into minidump dedicated partition.
It would be a good idea to reference this section earlier.
> +
> +By default, dumps are downloaded via USB to the attached x86_64 PC running
> +PCAT (Qualcomm tool) software. Upon download, we will see a set of binary
> +blobs starting with name ``md_*`` in PCAT configured directory in x86_64
> +machine, so for above example from the client it will be ``md_REGION_A.BIN``.
> +This binary blob depends on region content to determine whether it needs
> +external parser support to get the content of the region, so for simple
> +plain ASCII text we don't need any parsing and the content can be seen
> +just opening the binary file.
> +
> +To collect the dump to attached storage type, one needs to write appropriate
> +value to IMEM register, in that case dumps are collected in rawdump
> +partition on the product device itself.
> +
> +One needs to read the entire rawdump partition and pull out content to
> +save it onto the attached x86_64 machine over USB. Later, this rawdump
> +can be passed to another tool (``dexter.exe`` [Qualcomm tool]) which
> +converts this into the similar binary blobs which we have got it when
> +download type was set to USB, i.e. a set of registered regions as blobs
> +and their name starts with ``md_*``.
> +
> +Replacing the ``dexter.exe`` with some open source tool can be added as future
> +scope of this document.
---
bod
More information about the Linux-mediatek
mailing list