AI

NVIDIA Issues Hotfix for GPU Driver’s Overheating Issue

Yesterday, Nvidia hurried a critical hotfix to contain the consequences of an earlier release of the driver who had activated an alarm about AI and game employees by reporting falsely safe GPU temperatures – even if cooling demands climbs quietly to potentially critical levels.

In the civil servant of Nvidia after Around the Hotfix release, although only third in the list of Fixes listed, the problem is quoted as’GPU monitoring tools can stop reporting the GPU temperature after PC wakes up from sleep ‘.

Shortly after the affected game ready for the game 576.02 was rolled out, one Twist with the stable diffusing breddit, entitled Read to save your GPU!became a source for anecdotal issues and updates reported by users with regard to the new driver. A few timeline of emerging problems can be determined from these and other reports on the internet.

The first Reddit report from the bug seems to have perform Late Friday afternoon UTC, in the ZephyrusG14 Subreddit, where the userfricy81 quoted one after On Nvidia forums (archived):

A user on Nvidia forums finds problems after the update of 576.02. Source: https://www.nvidia.com/en-us/Force/Forums/Game-ready-drivers/13/563010/geforce-Grd-57602-feedback-threatreased-41625/3524072/

A user on Nvidia forums finds problems after the update of 576.02. Source: https://www.nvidia.com/en-us/Force/Forums/Game-ready-drivers/13/563010/geforce-Grd-57602-feedback-threatreased-41625/3524072/

The user on NVIDIA forums reported that after installing the update of the drivers, tools such as MSI Afterburner and in-game monitors such as those in Call of Duty (which generally have access to native system values, just as Task Manager’s GPU panel in Windows) stopped updating the GPU temperature values, freezing around 35-36 ° C.

Restarting the security software had no effect, the user explained, and only restarting a full system would recover accurate measurements. Tools such as Hwinfo and Nvidia’s own monitoring app continued to report the temperatures correctly. The user emphasized that the problem took place during normal use, not only after the system has been awakened from sleep.

User feedback on different forums emphasized a general disruption of the normal fan curve behavior and a change in the thermal regulation of the core, resulting in stationing graphic processing units with unexpectedly high temperatures, and alarming overheating under what is usually considered standard operational taxes, as detailed In this comment:

See also  Former Y Combinator president Geoff Ralston launches new AI ‘safety’ fund

“I could see that something was gone. The weather outside was probably about 55 ° f / 12 ° C, but I cooked alive in my room. My window was open, and yet I couldn’t feel a difference. All fans ran maximum and the Temps first looked good – after a while from 68 ° C to 72 ° C to 72 ° C.

‘In the beginning it seemed normal – until the next morning, when I realized [kicking].

‘I had done some AI over clocks after I had solved a few things lately, so I wasn’t sure if the values ​​were just too high. It happened once before installing ASUS AI Suite 3 – The BIOS settings would not even work well on it.

“Anyway, I went back to an older driver for the time being.”

Suboptimal

The official release PDF For Driver’s 576.02 update offers some instructions on changes that may have contributed to the new problems. In section 5.5 Nvidia acknowledges that the GPU temperature can be reported incorrectly on Nvidia Optimus Systems, specifically with zero degrees when no applications are performed.

Section 5.5 of the official 576.02 UPDATE NOTES ADRESSES Temperature monitoring problems that seem to have influenced a wider number of systems than the Optimus system. Source: https://us.download.nvidia.com/windows/576.02/576.02-win11-win10-Rease-notes.pdf

Section 5.5 of the official 576.02 UPDATE NOTES ADRESSES Temperature monitoring problems that seem to have influenced a wider number of systems than the Optimus system. Source: https://us.download.nvidia.com/windows/576.02/576.02-win11-win10-Rease-notes.pdf

The release states:

5.5 GPU temperature that has been reported incorrectly on Optimus systems

5.5.1 Problem

On Optimus systems, temperature reporting tools such as Speccy or GPU-Z report that the NVIDIA GPU temperature is zero when no applications are performed.

5.5.2 Declaration

On Optimus Systems, when the NVIDIA GPU is not used, it is placed in a State Low-Power. This ensures that temperature reporting tools return incorrect values. Wake up the GPU to ask for the temperature would result in meaningless measurements because the GPU temperature change is changing as a result.

These tools will only report accurate temperatures when the GPU is awake and is active.

NVIDIA Optimus is a GPU switching technology that switches between integrated and discreet images based on application requirements, to automatically balance performance and power consumption, designed to save the battery life and reduce power consumption. For tasks such as gaming or HD video -Playing, Optimus activates the discrete GPU for better performance; During lighter activities such as surfing the web, it returns to integrated (built -in) graphic images.

See also  Mistral board member and a16z VC Anjney Midha says DeepSeek won’t stop AI’s GPU hunger

The update seems to have extended a behavior that is rather limited to optimus systems, so that the affected GPU can enter a low-power condition while it is inactive, even when they are not hosted on an optimus system, in turn disrupts the temperature reporting in third-party tools.

Risk -adjustment

In most scenarios it is fair to say that the graphic cards VBIOS would probably have prevented permanent GPU damage. Vbio’s maintains thermal and power limits at firm war level, regardless of the driver.

Therefore, even if a driver would cause incorrect ventilation behavior or incorrect reporting temperatures, even if a driver would still cause the performance, inform the fan activity or else close the GPU to prevent hardware error.

That does not mean that the risk was trivial – persistent high temperatures can break down the performance over time or Stress adjacent components; In addition, apart from a common concept that an updated driver has caused a problem (not least in systems where drivers ‘silent’), a problem of this nature can mislead a large part of the affected users, who try opportunities for non-existent problems, or even possible damage to their systems caused by non-relevant ‘Fixes’.

The wandering behavior caused by Update 576.02 was particularly alarming for those involved in artificial intelligence orch flows, in which high-performance hardware is routinely pushed into its thermal limits for long-term duration.

The problematic 576.02 director inspired a broader result of complaints after the release of mid-April, despite the first report that it offered some useful performance improvements. Notwithstanding the determination of the hotfix, and the level of disturbance that appears to have caused 576.02, it remains at the time of writing Available for download* On the Nvidia site.

See also  Pioneering Open Models: Nvidia, Alibaba, and Stability AI Transforming the AI Landscape

Afterglow

In terms of the consequences of the defective update, many types of damage and / or discomfort have been reported: user Frankie_T9000 reported These were GPU crashed on start -up due to heat building under the FoutUpdate and only stabilized after malnutrition. He noticed ‘It seems that it is not permanently damaged, but must come back as soon as possible (I have pillows on Wednesday) that the old thermal pasta was more outdated by the heat building, so I put new pasta pads.

Yesterday another user in the same thread stated: I use a modified fan curve with MSI Afterburner, and it continued to show that my GPU temps were constant at 27 ° C, so the fans did not start, which led to overheated problems. I thought it was a me -problem, but after installing the previous driver it all worked out well again. The Temps was also not correctly displayed in Task Manager. ‘

Although NVIDIA (such as the persistent in every hotfix release) often offers hotfixes for certain video games or platforms, the risk of heat damage to or around a GPU is higher for AI practitioners than for video parliament, because intensive machine learning processes such as training or persistent inference take place a GPU. Under consistent long -term tax -An event that is probably only periodically activated in a game, that can ‘worry’ in high use for a Baas-Battle or a special demanding card section, but that is different as a compromise between GPU exploitation and system stability.

* Archive: https://archive.ph/ylvr1

Published for the first time Tuesday 22 April 2025

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button