[PATCH AUTOSEL 6.17-5.4] spi: xilinx: increase number of retries before declaring stall

Thu Nov 20 04:08:11 PST 2025

From: Alvaro Gamez Machado <alvaro.gamez at hazent.com>

[ Upstream commit 939edfaa10f1d22e6af6a84bf4bd96dc49c67302 ]

SPI devices using a (relative) slow frequency need a larger time.

For instance, microblaze running at 83.25MHz and performing a
3 bytes transaction using a 10MHz/16 = 625kHz needed this stall
value increased to at least 20. The SPI device is quite slow, but
also is the microblaze, so set this value to 32 to give it even
more margin.

Signed-off-by: Alvaro Gamez Machado <alvaro.gamez at hazent.com>
Reviewed-by: Ricardo Ribalda <ribalda at chromium.org>
Link: https://patch.msgid.link/20251106134545.31942-1-alvaro.gamez@hazent.com
Signed-off-by: Mark Brown <broonie at kernel.org>
Signed-off-by: Sasha Levin <sashal at kernel.org>
---

LLM Generated explanations, may be completely bogus:

## COMPREHENSIVE ANALYSIS: spi: xilinx: increase number of retries
before declaring stall

### 1. COMMIT MESSAGE ANALYSIS

The commit message clearly describes a real-world problem:
- **Problem**: SPI devices using slow frequencies need more time before
  being declared stalled
- **Specific case**: Microblaze CPU at 83.25MHz performing 3-byte
  transaction at 625kHz SPI frequency
- **Evidence**: Author empirically determined the value needed to be at
  least 20; chose 32 for safety margin
- **Tags**: Has "Reviewed-by" and "Link" tags, but **NO "Fixes:" or "Cc:
  stable at vger.kernel.org" tags**

### 2. DEEP CODE RESEARCH

#### A. HOW THE BUG WAS INTRODUCED:

The stall detection mechanism was originally introduced in **commit
5a1314fa697fc6** (2017-11-21) by Ricardo Ribalda. That commit added
stall detection logic with `stalled = 10` as a retry counter to detect
when the Xilinx SPI core gets permanently stuck due to unknown commands
not in the hardware lookup table.

The original commit message explained:
> "When the core is configured in C_SPI_MODE > 0, it integrates a lookup
table that automatically configures the core in dual or quad mode based
on the command (first byte on the tx fifo). Unfortunately, that list
mode_?_memoy_*.mif does not contain all the supported commands by the
flash... this leads into a stall that can only be recovered with a soft
reset."

The original commit **was tagged with "Cc: stable at vger.kernel.org"** and
was successfully backported to all major stable branches (verified
present in 6.6.y, 6.1.y, 5.15.y, 5.10.y, 4.19.y).

#### B. DETAILED CODE ANALYSIS:

The code is in `xilinx_spi_txrx_bufs()` function at line 303:

```c
rx_words = n_words;
stalled = 10;  // ← Original value too low for slow configurations
while (rx_words) {
    if (rx_words == n_words && !(stalled--) &&
        !(sr & XSPI_SR_TX_EMPTY_MASK) &&
        (sr & XSPI_SR_RX_EMPTY_MASK)) {
        dev_err(&spi->dev,
            "Detected stall. Check C_SPI_MODE and C_SPI_MEMORY\n");
        xspi_init_hw(xspi);
        return -EIO;  // ← Transaction fails
    }
    // ... read data from RX FIFO ...
}
```

**What the buggy code does:**
- Initializes `stalled` counter to 10
- In the polling loop, decrements `stalled` on each iteration when no
  data is available yet
- If counter reaches 0 and RX FIFO is still empty (but TX FIFO not
  empty), declares a hardware stall
- Returns -EIO, causing the SPI transaction to fail

**Why it's wrong:**
- The value 10 was chosen empirically in 2017, but it doesn't account
  for **very slow SPI frequencies combined with slow CPUs**
- With 625kHz SPI clock and 83.25MHz CPU, the legitimate data transfer
  time can exceed 10 polling loop iterations
- This causes **false positive stall detection** - the hardware is
  working correctly, just slowly

**The race condition:**
- CPU polling loop runs much faster than slow SPI hardware can transfer
  data
- Counter expires before legitimate data arrives in RX FIFO
- Hardware appears "stalled" when it's actually just slow

#### C. THE FIX EXPLAINED:

The fix changes line 303 from:
```c
stalled = 10;  // Old: Too short for slow configurations
```
to:
```c
stalled = 32;  // New: More margin for slow hardware
```

**Why this solves the problem:**
- Gives slow SPI transactions 3.2x more time before declaring a stall
- Author tested with real hardware (microblaze + 625kHz SPI) and found
  20 was sufficient
- Chose 32 to provide additional safety margin
- Still provides stall detection for genuine hardware hangs (from the
  original 2017 issue)

**Subtle implications:**
- Genuine hardware stalls will take slightly longer to detect (32 loop
  iterations vs 10)
- This is acceptable because: (1) genuine stalls are rare, (2) the delay
  is still very short (microseconds to milliseconds), (3) avoiding false
  positives is more important for reliability

### 3. SECURITY ASSESSMENT

**No security implications.** This is a timing/reliability issue, not a
security vulnerability. No CVE, no memory safety issues, no privilege
escalation.

### 4. FEATURE VS BUG FIX CLASSIFICATION

**This is definitively a BUG FIX:**
- Fixes incorrect hardcoded timeout value
- The original value causes false stall detection on slow hardware
- Subject uses "increase" rather than "fix", but the intent is clearly
  to fix a problem
- **NOT a feature addition** - doesn't add new functionality, just
  corrects an inadequate constant

**Does NOT fall under exception categories** (device IDs, quirks, DT,
build fixes), but doesn't need to - it's a straightforward bug fix.

### 5. CODE CHANGE SCOPE ASSESSMENT

- **Files changed**: 1 (drivers/spi/spi-xilinx.c)
- **Lines changed**: 1 (1 insertion, 1 deletion)
- **Complexity**: Trivial - changes a single integer constant
- **Scope**: Extremely localized - only affects stall detection timeout

**Assessment**: Minimal scope, surgical fix. This is the ideal type of
change for stable trees.

### 6. BUG TYPE AND SEVERITY

- **Type**: False positive error detection / incorrect timeout
- **Manifestation**: SPI transactions fail with -EIO error on slow
  hardware configurations
- **Severity**: **MEDIUM**
  - Not a crash, panic, or data corruption
  - Not a security issue
  - Causes service disruption: SPI devices fail to work properly
  - Affects specific configurations (slow SPI + slow CPU)
  - User-visible impact: Device failures, spurious error messages

### 7. USER IMPACT EVALUATION

**Who is affected:**
- Users of Xilinx SPI controllers (common in AMD/Xilinx FPGA-based
  embedded systems)
- Particularly affects:
  - Microblaze soft-core CPU users
  - Systems with slow SPI clock frequencies (< 1MHz)
  - Embedded systems with constrained CPU speeds
- Examples: Industrial control systems, FPGA-based embedded devices,
  custom hardware with Xilinx IP cores

**Impact scope:**
- **Moderate user base**: Xilinx SPI is used in FPGA designs but not as
  widespread as generic SPI controllers
- **High impact for affected users**: Complete SPI failure for those
  with slow configurations
- **Verifiable issue**: Author has real hardware that triggers this bug

**Call analysis**: The `xilinx_spi_txrx_bufs()` function is the main
data transfer function called for every SPI transaction on affected
hardware. It's in the critical path for all SPI operations.

### 8. REGRESSION RISK ANALYSIS

**Risk level: VERY LOW**

**Reasons:**
1. **Minimal change**: Single constant value modification
2. **Direction of change**: Increasing a timeout is inherently safer
   than decreasing it
3. **Tested**: Author has tested with real hardware
4. **Reviewed**: Ricardo Ribalda (the original stall detection author!)
   reviewed it
5. **Backward compatible**: Doesn't change behavior for properly
   functioning hardware
6. **Limited scope**: Only affects stall detection timing, not data path
   logic

**Potential risks:**
- Genuine hardware stalls detected slightly later (32 iterations vs 10)
  - Mitigation: Still quick enough (microseconds), rare occurrence
- None identified that would break working systems

### 9. MAINLINE STABILITY

- **Commit date**: November 7, 2024 (appears as 2025 but likely 2024)
- **Testing evidence**:
  - Reviewed-by: Ricardo Ribalda (original stall detection author)
  - Author tested with real hardware (microblaze + slow SPI)
- **Mainline presence**: In v6.18-rc6 (recent mainline development
  kernel)
- **Time in mainline**: Very recent - approximately 2 weeks if November
  2024

**Concern**: This is quite recent. Ideally would benefit from more time
in mainline (several weeks to months) before backporting.

### 10. DEPENDENCY CHECK

**No dependencies identified:**
- Change is self-contained (single constant)
- Doesn't depend on other commits
- Doesn't require API changes
- The stall detection code (commit 5a1314fa697fc6 from 2017) is already
  present in all stable branches
- Will apply cleanly to any stable branch that has the original stall
  detection code

**Verified**: The original stall detection code is present in:
- ✅ stable/linux-6.6.y
- ✅ stable/linux-6.1.y
- ✅ stable/linux-5.15.y
- ✅ stable/linux-5.10.y
- ✅ stable/linux-4.19.y

### 11. PRACTICAL VS THEORETICAL

**Highly practical:**
- Author encountered this bug with real hardware (microblaze @ 83.25MHz,
  SPI @ 625kHz)
- Provides specific reproduction case
- Not a theoretical race or corner case
- Users with slow SPI configurations will hit this reliably

### BACKPORT DECISION ANALYSIS

#### Alignment with Stable Kernel Rules:

1. **Obviously correct**: ✅ **YES**
   - Trivial change (increase timeout constant)
   - Reviewed by the original stall detection author
   - Logic is straightforward

2. **Fixes real bug**: ✅ **YES**
   - False stall detection on slow hardware
   - Author has reproducible case
   - Affects real users

3. **Important issue**: ⚠️ **MODERATE**
   - Causes SPI transaction failures
   - NOT a crash, security issue, or data corruption
   - Affects specific configurations (slow SPI/CPU)
   - Severity: Service disruption for affected users

4. **Small and contained**: ✅ **YES**
   - Single line change
   - No architectural changes
   - Minimal scope

5. **No new features**: ✅ **YES**
   - Pure bug fix
   - No new APIs or functionality

6. **Clean application**: ✅ **YES**
   - Should apply cleanly to all stable branches
   - Original code is present in all major LTS kernels

#### Considerations:

**In favor of backporting:**
- Extremely low regression risk (1 line, timeout increase)
- Fixes real-world bug with reproducible case
- Reviewed by original code author
- Small, obviously correct change
- Helps users with slow SPI configurations
- Original stall detection is already in stable branches

**Against backporting or requiring caution:**
- No "Cc: stable at vger.kernel.org" tag (maintainer didn't explicitly
  request it)
- No "Fixes:" tag (though logically fixes commit 5a1314fa697fc6)
- Very recent in mainline (limited testing time)
- Moderate severity (not critical crash/security/corruption)
- Affects niche configurations (slow SPI on Xilinx hardware)

#### Risk vs Benefit Trade-off:

**Benefit**:
- Enables SPI functionality for slow Xilinx configurations
- Prevents spurious transaction failures
- Minimal user-visible change (just works better)

**Risk**:
- Virtually none - increasing timeout is safe
- Genuine stalls detected marginally slower (acceptable)

**Trade-off**: **Favorable** - High benefit for affected users,
negligible risk

### FINAL ASSESSMENT

This commit represents a **textbook stable backport candidate** from a
technical perspective:
- Trivial, obviously correct change
- Fixes a real bug affecting actual users
- Zero risk of regression
- Small, contained scope

However, it has **two procedural concerns**:
1. **No explicit stable tag**: The maintainer/author didn't tag it with
   "Cc: stable at vger.kernel.org", suggesting they may not have considered
   it critical enough for backporting
2. **Very recent**: Limited testing time in mainline (approximately 2
   weeks)

The absence of the stable tag is notable because the **original stall
detection commit (5a1314fa697fc6) DID have the stable tag**, indicating
the original author (Ricardo Ribalda) understood stable backporting.
Ricardo also reviewed this current patch, so his review implies approval
of the technical approach.

Despite the lack of an explicit stable tag, the technical merits
strongly support backporting:
- It's an adjustment to an already-backported feature
- The change is trivial and safe
- It fixes a real problem for a subset of users
- Regression risk is negligible

The commit should be backported to **all stable branches that contain
the original stall detection code** (5a1314fa697fc6), which includes at
minimum: 6.6.y, 6.1.y, 5.15.y, 5.10.y, and 4.19.y.

**Recommendation**: This would benefit from a few more weeks in mainline
for additional testing, but can be safely backported. The lack of a
stable tag is the only significant concern, but the technical merits
outweigh this procedural consideration.

**YES**

 drivers/spi/spi-xilinx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/spi/spi-xilinx.c b/drivers/spi/spi-xilinx.c
index d59cc8a184846..c86dc56f38b45 100644
--- a/drivers/spi/spi-xilinx.c
+++ b/drivers/spi/spi-xilinx.c
@@ -300,7 +300,7 @@ static int xilinx_spi_txrx_bufs(struct spi_device *spi, struct spi_transfer *t)
 
 		/* Read out all the data from the Rx FIFO */
 		rx_words = n_words;
-		stalled = 10;
+		stalled = 32;
 		while (rx_words) {
 			if (rx_words == n_words && !(stalled--) &&
 			    !(sr & XSPI_SR_TX_EMPTY_MASK) &&
-- 
2.51.0