[PATCH v4 4/4] thermal: Add Tegra SOCTHERM thermal management driver

Tue Aug 19 07:09:55 PDT 2014

> This adds support for the Tegra SOCTHERM thermal sensing and management
> system found in the Tegra124 system-on-chip. This initial driver supports
> temperature polling for four thermal zones.

I have several comments about this patch. Overall, the code looks clean,
way cleaner than NVIDIA's internal soc_therm driver. I adopted your code
to run on the internal firmware of a Tegra124 SoC. Additionally, I
tested the code as-is on a Jetson TK1.

My test shows that the temperature readings look sane and do vary with
load, so the code seems to work. However, I have some concerns about the
calibration values calculated by this code and the error handling of the
code.

Originally, I thought the fuse offsets were incorrect but as it turns
out, the official Linux kernel starts counting the fuses at a different
location than NVIDIA's internal kernel.

[snip]
> +static int calculate_shared_calibration(struct tsensor_shared_calibration *r)
> +{
> +	u32 val;
> +	u32 shifted_cp, shifted_ft;
> +	int err;
> +
> +	err = tegra_fuse_readl(FUSE_TSENSOR8_CALIB, &val);
> +	if (err)
> +		return err;
> +	r->base_cp = val & FUSE_TSENSOR8_CALIB_CP_TS_BASE_MASK;
> +	r->base_ft = (val & FUSE_TSENSOR8_CALIB_FT_TS_BASE_MASK)
> +		>> FUSE_TSENSOR8_CALIB_FT_TS_BASE_SHIFT;
> +
> +	err = tegra_fuse_readl(FUSE_SPARE_REALIGNMENT_REG_0, &val);
> +	if (err)
> +		return err;
> +	shifted_cp = sign_extend32(val, 5);
> +	val = ((val & FUSE_SPARE_REALIGNMENT_REG_SHIFT_FT_MASK)
> +		>> FUSE_SPARE_REALIGNMENT_REG_SHIFT_FT_SHIFT);
> +	shifted_ft = sign_extend32(val, 4);
> +
> +	r->actual_temp_cp = 2 * NOMINAL_CALIB_CP_T124 + shifted_cp;
> +	r->actual_temp_ft = 2 * NOMINAL_CALIB_FT_T124 + shifted_ft;
> +
> +	return 0;
> +}
> +
> +static int calculate_tsensor_calibration(
> +	struct tegra_tsensor *sensor,
> +	struct tsensor_shared_calibration shared,
> +	u32 *calib
> +)
> +{
> +	u32 val;
> +	s32 actual_tsensor_ft, actual_tsensor_cp;
> +	s32 delta_sens, delta_temp;
> +	s32 mult, div;
> +	s16 therma, thermb;
> +	int err;
> +
> +	err = tegra_fuse_readl(sensor->calib_fuse_offset, &val);
> +	if (err)
> +		return err;
> +
> +	actual_tsensor_cp = (shared.base_cp * 64) + sign_extend32(val, 12);
> +	val = (val & FUSE_TSENSOR_CALIB_FT_TS_BASE_MASK)
> +		>> FUSE_TSENSOR_CALIB_FT_TS_BASE_SHIFT;
> +	actual_tsensor_ft = (shared.base_ft * 32) + sign_extend32(val, 12);
> +
> +	delta_sens = actual_tsensor_ft - actual_tsensor_cp;
> +	delta_temp = shared.actual_temp_ft - shared.actual_temp_cp;
> +
> +	mult = t124_tsensor_config.pdiv * t124_tsensor_config.tsample_ate;
> +	div = t124_tsensor_config.tsample * t124_tsensor_config.pdiv_ate;
> +
> +	therma = div_s64((s64) delta_temp * (1LL << 13) * mult,
> +		(s64) delta_sens * div);
> +	thermb = div_s64(((s64) actual_tsensor_ft * shared.actual_temp_cp) -
> +		((s64) actual_tsensor_cp * shared.actual_temp_ft),
> +		(s64) delta_sens);
> +
> +	therma = div_s64((s64) therma * sensor->fuse_corr_alpha,
> +		(s64) 1000000LL);
> +	thermb = div_s64((s64) thermb * sensor->fuse_corr_alpha +
> +		sensor->fuse_corr_beta,
> +		(s64) 1000000LL);
> +
> +	*calib = ((u16)(therma) << SENSOR_CONFIG2_THERMA_SHIFT) |
> +		((u16)thermb << SENSOR_CONFIG2_THERMB_SHIFT);
> +
> +	return 0;
> +}
> +
> +static int enable_tsensor(struct tegra_soctherm *tegra,
> +			  struct tegra_tsensor *sensor,
> +			  struct tsensor_shared_calibration shared)
> +{
> +	void * __iomem base = tegra->regs + sensor->base;
> +	unsigned int val;
> +	u32 calib;
> +	int err;
> +
> +	err = calculate_tsensor_calibration(sensor, shared, &calib);
> +	if (err)
> +		return err;
> +
> +	val = 0;
> +	val |= t124_tsensor_config.tall << SENSOR_CONFIG0_TALL_SHIFT;
> +	writel(val, base + SENSOR_CONFIG0);
> +
> +	val = 0;
> +	val |= (t124_tsensor_config.tsample - 1) <<
> +		SENSOR_CONFIG1_TSAMPLE_SHIFT;
> +	val |= t124_tsensor_config.tiddq_en << SENSOR_CONFIG1_TIDDQ_EN_SHIFT;
> +	val |= t124_tsensor_config.ten_count << SENSOR_CONFIG1_TEN_COUNT_SHIFT;
> +	val |= SENSOR_CONFIG1_TEMP_ENABLE;
> +	writel(val, base + SENSOR_CONFIG1);
> +
> +	writel(calib, base + SENSOR_CONFIG2);
> +
> +	return 0;
> +}

This code writes different values to SENSOR_CONFIG2 registers than what
the NVIDIA's internal soc_therm driver does. Because the calibration
value calculation should be based on the same fuses that NVIDIA's
internal driver reads, I believe the value calculated and eventually
written to SENSOR_CONFIG2 should be identical with NVIDIA's internal
driver. That is not the case now.

I first loaded the NVIDIA's internal soc_therm driver and then loaded
code adopted from your driver running on Tegra's firmware. Here's a log
of how the CONFIG2 values are changed by this code to different values
than NVIDIA's internal soc_therm driver originally sets them to:

CONFIG2: 5b0f813 -> 5c6f7fb
CONFIG2: 5e8f7cd -> 5fff7b4
CONFIG2: 5bdf7ce -> 5d3f7b5
CONFIG2: 596f831 -> 5aaf81a
CONFIG2: 675f6b5 -> 68cf698
CONFIG2: 6d6f641 -> 6eff623
CONFIG2: 641f72b -> 659f710
CONFIG2: 590f861 -> 5a5f84a

On the left, there's the value set by NVIDIA's internal soc_therm driver
and on the right, there's the value set by code adopted from your
driver. My modifications to the code are minor, and I don't think they
could explain why the CONFIG2 values set are different.

If you want them, I can supply you the values of the fuses on my
development board.

The temperature readings look sane and do vary with load, but I think
they're a bit different than what NVIDIA's internal soc_therm driver
gives. I'm not entirely sure if the calibration values set by your
driver or the calibration values set by NVIDIA's internal soc_therm
driver are correct, but it seems to me that only one of these drivers
can be correct.

The values written to CONFIG1 and CONFIG0 do seem sane.

Since there are divisions being done, some sort of explicit protection
from divisions by zero should perhaps be considered, although this can
happen only if the fuses for some reason read all zeroes. I'm not sure
if that's possible with real hardware.

Where does this code come from? It does not definitely come from
NVIDIA's internal soc_therm driver, as it looks entirely different. And
it calculates different calibration values. If you wrote the code, you
should consider commenting it since there are a lot of complex
calculations going on that are not obvious to the reader. For example,
what do "cp" and "ft" mean? Are they acronyms? If so, they should be
explained.

[snip]
> +	for (i = 0; i < ARRAY_SIZE(tegra->thermctl_tzs); ++i) {
> +		struct tegra_thermctl_zone *zone =
> +			devm_kzalloc(&pdev->dev, sizeof(*zone), GFP_KERNEL);
> +		if (!zone) {
> +			err = -ENOMEM;

Shouldn't this one have a --i line like the next IS_ERR block?

> +			goto unregister_tzs;
> +		}
> +
> +		zone->temp_reg = tegra->regs + thermctl_temp_offsets[i];
> +		zone->temp_shift = thermctl_temp_shifts[i];
> +
> +		tz = thermal_zone_of_sensor_register(
> +			&pdev->dev, i, zone, tegra_thermctl_get_temp, NULL);
> +		if (IS_ERR(tz)) {
> +			err = PTR_ERR(tz);
> +			dev_err(&pdev->dev, "failed to register sensor: %d\n",
> +				err);
> +			--i;
> +			goto unregister_tzs;
> +		}
> +
> +		tegra->thermctl_tzs[i] = tz;
> +	}