GPU and CPU idle during training
... / GPU and CPU idle during t...
BMPCreated with Sketch.BMPZIPCreated with Sketch.ZIPXLSCreated with Sketch.XLSTXTCreated with Sketch.TXTPPTCreated with Sketch.PPTPNGCreated with Sketch.PNGPDFCreated with Sketch.PDFJPGCreated with Sketch.JPGGIFCreated with Sketch.GIFDOCCreated with Sketch.DOC Error Created with Sketch.
question

GPU and CPU idle during training

Par
DamienL38
Créé le 2022-09-20 09:12:05 (edited on 2024-09-04 13:04:56) dans AI and Machine Learning OVHcloud

Hello,
I've been using AI Training for several big trainings and every time the training time is abnormaly long (And billed accordingly). It takes 30 days for a training that should take not longer than 5 days.
The Job Monitoring interface shows that both CPU and GPU are idle most of the time.
image
I use a custom image based on python:3.9 (which is based on debian buster).
Data are located in a mounted object storage RW:cache. (Note that the problem was the same without the cache). Outputs are stored in the same storage.
I suspect an IO or journalisation problem, or a problem related to the use/sync of object storage but i cannot inquire it as IO monitoring is not available.
Am i doing something wrong ?


1 réponse ( Latest reply on 2022-09-26 09:54:25 Par
FabL
)

Bonjour @DamienL38,

Si le dysfonctionnement est toujours d'actualité, je vous invite à préciser davantage d'éléments et/ou tests effectués afin que la communauté puisse vous apporter un retour.

^FabL

Les réponses sont actuellement désactivées pour cette question.