Hello,
I've been using AI Training for several big trainings and every time the training time is abnormaly long (And billed accordingly). It takes 30 days for a training that should take not longer than 5 days.
The Job Monitoring interface shows that both CPU and GPU are idle most of the time.
I use a custom image based on python:3.9 (which is based on debian buster).
Data are located in a mounted object storage RW:cache. (Note that the problem was the same without the cache). Outputs are stored in the same storage.
I suspect an IO or journalisation problem, or a problem related to the use/sync of object storage but i cannot inquire it as IO monitoring is not available.
Am i doing something wrong ?
GPU and CPU idle during training
Sujets apparentés
- Mon site perdu sur Google
2493
11.09.2021 07:13
- OVH Prescience 1.4.0
2467
24.10.2018 14:14
- Erreur optimisation
2426
02.04.2019 14:03
- Message d erreur à l étape 10 sur 11: Step fail
1668
29.10.2020 13:54
- Pb avec l'exemple "premiers pas"
1668
18.03.2020 17:32
- Aide configuré ftp filezilla
1651
22.12.2020 15:48
- Possible de remplacer Betty?
1588
21.12.2020 09:11
- Impossible de modifier ma base de données
1529
25.02.2021 14:57
- Modifier l'adresse inscrite sur mon site web
1509
14.09.2021 15:56
- Library problem
1499
11.02.2021 14:00
Bonjour @DamienL38,
Si le dysfonctionnement est toujours d'actualité, je vous invite à préciser davantage d'éléments et/ou tests effectués afin que la communauté puisse vous apporter un retour.
^FabL