We have planned a short Hydra maintenance window for this Monday 16th of December from 9:00. The jobs scheduler has been programmed to ensure that no jobs will be running by Monday 16th. Pending jobs will be started once the maintenance window is completed. We expect to complete the maintenance by 16:00. Two operations will be carried out:
If you have any question, please contact us at email@example.com or firstname.lastname@example.org
The HPC team is pleased to announce a new series of training sessions, in collaboration with the Vlaams Supercomputer Centrum (VSC) and the Consortium des Équipements de Calcul Intensif (CECI), focusing on young and/or early stage researchers (Doctoral students, Master thesis students). Full-day hands-on sessions are organized for both Linux and HPC. See our Events page for more information and registration.
The Hydra maintenance is over. We have successfully finished all scheduled tasks. All users can now submit and run jobs again. Some important considerations:
If you find anything that doesn't work, please let us know at email@example.com / firstname.lastname@example.org.
Hydra will be shut down for its annual maintenance on September 9th at 9 AM. This maintenance will be carried out for the whole week until the end of September 13th. A reservation is already in place to make sure that all jobs finish on time before the maintenance start. All new jobs will be accepted, but they will only start if their walltime does not fall into the maintenance period. During this maintenance we will remove all remaining software related to the old magnycours nodes (no longer in production). Please, check that your scripts do not rely on files/executables located in /apps/brussel/CO7/magnycours. When Hydra is down there is no way to access your storage space, and thus access to other VSC sites will be blocked. We apologise for the incovienience.
On Monday 12th of August at 6:45 AM, a power cut partially affected the infrastructure of our datacenter. All nodes of Vega and all Skylake nodes of Hydra (37% of total) rebooted, causing the loss of running jobs. The storage is not affected by the power cut. If your job saves any data during its run, everything saved up to the moment of the power failure should be available in its usual location. The power cut only lasted a few minutes and at 9:30 AM the situation was back to normal. Currently, all nodes are operational. Our apologies for the inconvenience. If you have any questions or encounter problems, don't hesitate to contact us at email@example.com or firstname.lastname@example.org.
Today (July 25th) the cooling system of Hydra and Vega suffered a critical failure caused by the extreme heatwave. As a consequence,the first course of action has been to pause the queues of both clusters to avoid overheating the system and negatively affecting running jobs. Unfortunately, at 1:00 PM the temperature inside the data centre reached 40C, which forced us to requeue all running jobs on Hydra and put both clusters on standby mode. We managed to keep the storage and the login node of Hydra active, so it is still possible for our users to retrieve their data. Moreover, running jobs that were close to finish with less than 24h of remaining time have been kept alive and should finish normally. Requeued jobs will automatically restart as soon as the queues are resumed. No further action will be needed on the user's side. Currently, the reparation of the cooling system is still ongoing. We hope to partially reactivate Hydra tomorrow, Friday 26th. We apologize for the inconvenience.
Hydra is getting a new pair of login nodes on July 8th 2019. The hardware powering these new login nodes is a big step forward in performance to ensure a stable and smooth user experience:
The address of the login nodes has not changed and they are accessible at
login.hpc.ulb.ac.be. Your data is not affected by the upgrade, it will stay in the same location. There are some differences with the old login nodes regarding the fair distribution of resources between the users, a new pre-processor of job scripts and a graphical environment among others. See the SISC HPC Documentation FAQs for more information.
We had a cooling issue at the SISC datacenter (DC) this evening between 17:00 and 18:00. The situation was quickly handled and the cooling system was again operational from 18:00. The temperatures in the DC are now (20:00) close to normal. Only two Hydra nodes automatically stopped, the remaining compute nodes that were running jobs were not stopped. We have decided to keep ~20% of the compute nodes offline for the coming days as a precautionary measure. The job scheduler Moab will be on pause for the night and will be restarted tomorrow morning. This imply that jobs can be submitted but will remain queued. Once Moab is started again, the jobs will become eligible for execution. Decision on starting again the stopped nodes will be taken once the conditions are favourable, and communication will follow.
We have applied in both Hydra (3 June) and Vega (12 June) an important bug fix to the OpenBLAS module OpenBLAS/0.3.1-GCC-7.3.0-2.30. The bug was impacting matrix operations (dot product and/or multiplication) using OpenMP: if the product of the matrix size and the number of cores used for the computation is above ~10000, the result could be incorrect in some cases. Jobs on a single core were not impacted by this bug. See https://github.com/xianyi/OpenBLAS/issues/1673 for more details. Users who have previously run multi-core jobs with software relying on the foss/2018b or fosscuda/2018b toolchains should check the correctness of their results. We advise to sample your results by rerunning a few jobs. If one new result is different with the patched module, all jobs executed with the problematic module should be re-submitted. Contact us at email@example.com or firstname.lastname@example.org for more information or help. More info on the fix: - https://github.com/xianyi/OpenBLAS/commit/b14f44d2adbe1ec8ede0cdf06fb8b09f3c4b6e43 - https://github.com/easybuilders/easybuild-easyconfigs/pull/8396/files
The memory upgrade on the four storage controller nodes of Hydra completed successfully this Monday 3rd of June at 11:00. The jobs scheduler is again active and therefore queuing jobs are being started.
We'll add extra memory to the four controller nodes of the Hydra storage this Monday 3rd of June. The job scheduler will be place on pause before executing the operation meaning that running jobs will not be impacted. The operation should be over by the afternoon.
Due to the popularity of the training sessions (all seats booked in less than 24 hours), we have decided to organise another one: a hands-on session on Linux is planned for January 14th and a hands-on session on HPC for January 15th 2019. See our Events page for more information.
Works on the ULB high voltage power grid are completed. Hydra and Vega HPC clusters are back online and jobs are running.
There will be works carried on the ULB high voltage power grid on Monday 19 and Tuesday 20 of November. The works will imply two power cuts which will impact the Hydra and Vega HPC clusters. We have therefore planned a downtime for both clusters from 19/11 15:00 until 20/11 morning. As soon as the works on the power grid are completed, we’ll put back the clusters online and send a notification. A reservation has been placed to ensure that no jobs are running during the maintenance window. This also imply that jobs with a walltime running into the maintenance window will be started only after the maintenance.
The HPC team is pleased to announce a new series of training sessions, in collaboration with the Vlaams SupercomputerCentrum (VSC) and the Consortium des Équipements de Calcul Intensif (CECI), focussing on young and/or early stage researchers (doctoral students, Master thesis students). This time we'll propose the usual Linux and HPC hands-on sessions plus a new session on Grid computing. See our Events page for more information and registration.
A planned firmware upgrade of one of the Hydra switches lead to a short access interruption from the compute nodes to the Hydra storage. The outage took place between 8:40 and 8:50 this morning. The redundancy at the network and storage levels clearly did not work. Some jobs have been impacted and were lost. Users should check the output of terminated jobs for possible errors. The issue will be investigated and hopefuly a fix will be found. We apologize for any inconvenience caused.
Works on the power grid of the ULB/VUB SISC datacenter are over. Hydra is now back at 100% capabilities. Some nodes supposed to remain powered have been stopped which has caused the loss of some jobs. Users with jobs completed this Friday morning should check the outputs for possible execution interruption. Sorry for any inconvenience caused.
Works on the power grid of the ULB/VUB SISC datacenter are over. Vega is now back online.
Works will take place on the power grid of the ULB/VUB SISC datacenter on Friday 3rd of August between 8:00 and 12:00. Some electric lines powering Hydra will be offline. We have therefore prepared the cluster to run with a reduced number of nodes. Access to Hydra and the data will remain possible and jobs will continue to run. The works should be completed within Friday morning and Hydra will be again fully operational in the afternoon.
Works will take place on the power grid of the ULB/VUB SISC datacenter on Friday 3rd of August between 8:00 and 12:00. Electric lines powering Vega will be offline. We have therefore prepared the cluster for a shutdown. No job will be running at the time of the shutdown (a reservation on all the nodes is in place) and therefore no expected loss. Queued jobs will be maintained as well. Once the works are completed, Vega will be powered back and will be online beginning of the afternoon, or sooner.
This morning at 9:20 a routine operation on the Hydra network was being performed by VUBNET which took down connectivity between Hydra storage and the compute nodes. The problem was quickly identified and connectivity restored. Some operations followed to recover pending data not written to the storage (no data loss is expected) and Hydra storage was fully recovered at 10:00. Unfortunately, almost all jobs crashed during storage outage. Please check your completed jobs of this morning for eventual error messages.
We have completed the maintenance works on Hydra. It took slightly longer then expected. Our apologies for that. The cluster is again accessible and is running jobs. The nodes operating system was updated, data reorganised on the Hydra storage (nothing changed on users' side), Torque updated to the latest version and the maximum walltime reduced to 5 days.
We have completed the maintenance works: Vega is again accessible and queued jobs are now running. The nodes operating system was updated and Slurm was upgraded to version 17.11.7.
Vega will be on maintenance from 25/06 to 27/06 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Vega core services. The maintenance will not kill jobs on the cluster or remove queued jobs. We have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 25/06 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.
Hydra will be on maintenance from 25/06 to 29/06 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Hydra core services. We will change the maximum walltime from 14 days to 5 days. Pending jobs still in the queue and with a walltime above 5 days will be updated accordingly. The maintenance will not kill jobs on the cluster or remove queued jobs. As for previous planned Hydra downtimes, we have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 25/06 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.
We have identified jobs on Hydra involved in the central storage NFS service crash. There was ~100 identical jobs running a short loop with an open-write-close action on the same file located in the Hydra home directory. This led to 1) massive IOPS on the NFS server and 2) NFS lock race condition that eventually lead to the NFS service crash. We will continue to work with the storage vendor to help them figure out why their NFS service crashed. This incident is also a good reminder for all to avoid using the Hydra home directory for IO intensive jobs. We have deployed a dedicated storage on Hydra capable to sustain massive IO loads and you can use it via your work directory. Hydra and Vega clusters are again fully operational.
After further investigations of the NFS issue on the SISC central storage it came out that when specific private networks are made accessible to the central storage, the crash follows a few hours after. Hydra and Vega are accessing the NFS via private networks involved in the NFS crash phenomenon. To preserve other critical services relying on, but not impacting, the central storage (like the email service), Hydra and Vega private networks have been cut out of the central storage. At this stage we still don't know if it is a rogue process or a network related issue. We continue to investigate in collaboration with SISC colleagues and the network teams.
The SISC central storage is encountering issues making the Hydra home directories unavailable. Login to Hydra will be impossible and running jobs relying on files stored on the home directories will fail. The job scheduler has been stopped to prevent new jobs starting. The issue is being investigated and further information will be posted once the issue has been resolved.
The VSC Users Day 2018 will take place on 22nd of May 2018 at the Koninklijke Vlaamse Academie van België voor Wetenschappen en Kunsten. See https://www.vscentrum.be/en/education-and-trainings/detail/vsc-user-day-22052018
The 10th CÉCI Scientific Meeting will take place on 4th of May 2018 at UNamur. See http://www.ceci-hpc.be/scientificmeeting.html
PRACE has issued the 17th call for Proposals.
Deadline: 2nd May 2018, 10:00 CET.
Stake: single-year and multi-year proposals starting 2nd October 2018.
Resources: Joliot-Curie, Hazel Hen, JUWELS, Marconi, MareNostrum IV, Piz Daint and SuperMUC.
We replaced the dying switch by a new one (thanks VUBNET team!). The cluster is back online. Note that jobs that were running at the time of the switch faiure have been lost. We kept Slurm stopped to prevent other job losses. Jobs that were in the queue are now running. New jobs can be submitted now.
It seems that one of the Gbps switches on Vega is dying (ports going down and ramdom restart). We are investigating the issue and will replace the switch if this is a hardware issue. The cluster is currently offline and will stay as such a bit longer if the switch must be replaced.
Broken disks have been replaced on Vega (no impact on data availability). A deep cleaning also recovered 20 TB of storage space.
We have installed Gaussian version 16 on Hydra. To use this version, simply load the right module: module load gaussian16 For Gaussian 09 users: we recommend rerunning your latest jobs with G16 and compare the results.
We have completed the maintenance works: Hydra is again accessible and queued jobs are now running. Works summary: 1) The entire Hydra Ethernet network has been rebuilt from scratch with new switches and new cabling. All network communications have been therefore improved for a globally increased performance in data management & transfers. 2) Four new GPGPU nodes have been added, each with 2x Tesla P100. 3) The storage capacity has been increased to 800 TB. 4) The usual OS/software updates & upgrades have been made including the installation of security patches.
We are planning a maintenance window on Hydra from 11/12 to 15/12 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Hydra core services including a complete network upgrade (physical and logical levels), 4 new GPGPU nodes and an increase of the storage capacity to 750 TB. The maintenance will not kill jobs on the cluster or remove queued jobs. As for previous planned Hydra downtimes, we have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 11/12 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.
We are planning a maintenance window on Hydra on 25/07 with a scheduled downtime. The downtime should last a few hours. A global reservation on Hydra has been created to make sure that no job will be running on Sept. 25. Jobs that can complete before the date will be executed and those with a walltime beyond Sept 25 will be maintained in the queue.
We are redirecting all the Vega logs to our ELK platform. The ELK analytic capabilities will permit to better spot issues with the compute nodes and with the submitted/running job. Our first objective: spot jobs wasting CPU or memory resources.
Long standing hardware issues with some Vega compute nodes is being tackled. We'll remove the problematic nodes from Vega, fix them and place them back in Vega. The total number of available compute nodes will vary in the coming days.
We are happy to announce the hire of a new CECI logisticien: Ariel Lozano. Welcome Ariel!
We are planning a maintenance window on Hydra from 03/07 to 06/07 with a scheduled downtime. A reservation will be placed on the cluster from 19/06. Jobs that can complete before the 03/07 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.
The cooling system is now back to normal. The Vega and Hydra clusters are fully operational too.
This morning at ~ 9:30 the datacentre cooling system failed. Given the temperature increase in the datacentre we had to switch off the Hydra and Vega compute nodes. Once the cooling system has been repared, we'll switch on the compute nodes. Jobs that were running will be lost but those in the queue have been kept. All data are safe on the Hydra storage. We'll send another notification once the clusters are back online.
The VSC has a vacancy for an account manager VSC and industry. You will be employed by the Fonds Wetenschappelijk Onderzoek - Vlaanderen. Closing date for applications is June 12, 2017. Given the nature of the job, the vacancy is only available in Dutch. See the detailed ammounce at https://www.vscentrum.be/en/news/detail/vacancy-account-manager-vsc-and-industry
This user day will take place on June 2, 2017 in Brussels. For more information please see https://www.vscentrum.be/events/userday-2017
UCL professors Gian-Marco Rignanese and Jean-Christophe Charlier have been granted 20 millions core-hours on Marconi KNL (CINECA, Italy). Congratulation to them!
The ninth CÉCI scientific meeting day is organised in Louvain-la-Neuve on April 21st. More information and registration: see http://www.ceci-hpc.be/scientificmeeting.html
The CÉCI common filesystem is fully available on the 6 CÉCI clusters. For example, the partition /CECI/home/ is directly accessible from all the login and compute nodes on all the clusters. Make sure to try it out! More information on https://support.ceci-hpc.be/doc/_contents/ManagingFiles/TheCommonFilesystem.html
The next four years the KU Leuven will host the VSC Tier-1 supercomputer. We like to present this supercomputer to you. See https://www.vscentrum.be/en/news/detail/breniac-the-new-tier-1-supercomputer-in-flanders