HPC News

  • 11/12/2019 December

    Storage migration on December 16th

    We have planned a short Hydra maintenance window for this Monday 16th of December from 9:00. The jobs scheduler has been programmed to ensure that no jobs will be running by Monday 16th. Pending jobs will be started once the maintenance window is completed. We expect to complete the maintenance by 16:00. Two operations will be carried out:

    1. Switch the home directories for NetID accounts from the SISC central storage to the Hydra storage. SISC is replacing the old central storage with a new storage solution. Therefore, many services at SISC are subject to a data migration process and, in that context, it has been decided to move the NetID home directories to the Hydra storage. On December 16th, we will finish the NetID home directories migration (run a last data synchronization operation) to subsequently perform the switch of the home directories mount points to the Hydra local storage. The operation will be totally transparent: no action is needed on the user side.
    2. Switch from VUB NetIDs to VSC accounts Following our previous message on December 2nd, we would like to remind you that from December 16th all VUB users requiring access to Hydra must have a VSC account. All VUB NetIDs will be disabled. Several VUB users already migrated their account and we are inviting all the others to do so in the coming days. The migration process is fully automated from the VSC account creation step. The detailed procedure is described in our documentation page about the migration.
    3. If you have any question, please contact us at hpc@vub.ac.be or hpc@ulb.ac.be

  • 27/09/2019 September

    A new series of training session is available

    The HPC team is pleased to announce a new series of training sessions, in collaboration with the Vlaams Supercomputer Centrum (VSC) and the Consortium des Équipements de Calcul Intensif (CECI), focusing on young and/or early stage researchers (Doctoral students, Master thesis students). Full-day hands-on sessions are organized for both Linux and HPC. See our Events page for more information and registration.

  • 16/09/2019 September

    Annual Maintenance of Hydra is complete

    The Hydra maintenance is over. We have successfully finished all scheduled tasks. All users can now submit and run jobs again. Some important considerations:

    • Software in /apps/brussels/CO7/magnycours has been deleted.
    • The hostname of many nodes has changed from 'nic*' to 'node*'. Don’t worry if you see those changes, it won’t affect your calculations.
    • The private nodes and the himem node are still offline. They will be switched on as soon as possible.

    If you find anything that doesn't work, please let us know at hpc@vub.ac.be / hpc@ulb.ac.be.

    Happy computing!

  • 02/09/2019 September

    Annual Maintenance of Hydra

    Hydra will be shut down for its annual maintenance on September 9th at 9 AM. This maintenance will be carried out for the whole week until the end of September 13th. A reservation is already in place to make sure that all jobs finish on time before the maintenance start. All new jobs will be accepted, but they will only start if their walltime does not fall into the maintenance period. During this maintenance we will remove all remaining software related to the old magnycours nodes (no longer in production). Please, check that your scripts do not rely on files/executables located in /apps/brussel/CO7/magnycours. When Hydra is down there is no way to access your storage space, and thus access to other VSC sites will be blocked. We apologise for the incovienience.

  • 12/08/2019 August

    Power cut at SISC

    On Monday 12th of August at 6:45 AM, a power cut partially affected the infrastructure of our datacenter. All nodes of Vega and all Skylake nodes of Hydra (37% of total) rebooted, causing the loss of running jobs. The storage is not affected by the power cut. If your job saves any data during its run, everything saved up to the moment of the power failure should be available in its usual location. The power cut only lasted a few minutes and at 9:30 AM the situation was back to normal. Currently, all nodes are operational. Our apologies for the inconvenience. If you have any questions or encounter problems, don't hesitate to contact us at hpc@vub.ac.be or hpc@ulb.ac.be.

  • 29/07/2019 July

    Cooling issue resolved

    Hydra and Vega are back at full capacity. The job scheduler is active and jobs are being processed as usual. Requeued jobs will automatically restart as nodes are available. If you have any problem, please contact our support at hpc@ulb.ac.be or hpc@vub.ac.be.

  • 25/07/2019 July

    Cooling issue in the SISC datacenter

    Today (July 25th) the cooling system of Hydra and Vega suffered a critical failure caused by the extreme heatwave. As a consequence,the first course of action has been to pause the queues of both clusters to avoid overheating the system and negatively affecting running jobs. Unfortunately, at 1:00 PM the temperature inside the data centre reached 40C, which forced us to requeue all running jobs on Hydra and put both clusters on standby mode. We managed to keep the storage and the login node of Hydra active, so it is still possible for our users to retrieve their data. Moreover, running jobs that were close to finish with less than 24h of remaining time have been kept alive and should finish normally. Requeued jobs will automatically restart as soon as the queues are resumed. No further action will be needed on the user's side. Currently, the reparation of the cooling system is still ongoing. We hope to partially reactivate Hydra tomorrow, Friday 26th. We apologize for the inconvenience.

  • 08/07/2019 July

    New login nodes for Hydra

    Hydra is getting a new pair of login nodes on July 8th 2019. The hardware powering these new login nodes is a big step forward in performance to ensure a stable and smooth user experience:

    • Intel Skylake (Xeon Gold 6126) - 24 cores in total
    • 96GB of memory (RAM)
    • 10Gb Ethernet network connection
    • Infiniband EDR connection to the storage (exception: the $HOME directory of netID users)

    The address of the login nodes has not changed and they are accessible at login.hpc.vub.ac.be and login.hpc.ulb.ac.be. Your data is not affected by the upgrade, it will stay in the same location. There are some differences with the old login nodes regarding the fair distribution of resources between the users, a new pre-processor of job scripts and a graphical environment among others. See the SISC HPC Documentation FAQs for more information.

  • 25/06/2019 June

    Cooling issue in the SISC datacenter

    We had a cooling issue at the SISC datacenter (DC) this evening between 17:00 and 18:00. The situation was quickly handled and the cooling system was again operational from 18:00. The temperatures in the DC are now (20:00) close to normal. Only two Hydra nodes automatically stopped, the remaining compute nodes that were running jobs were not stopped. We have decided to keep ~20% of the compute nodes offline for the coming days as a precautionary measure. The job scheduler Moab will be on pause for the night and will be restarted tomorrow morning. This imply that jobs can be submitted but will remain queued. Once Moab is started again, the jobs will become eligible for execution. Decision on starting again the stopped nodes will be taken once the conditions are favourable, and communication will follow.

  • 13/06/2019 June

    Fixed a bug in OpenBLAS/0.3.1-GCC-7.3.0-2.30 module

    We have applied in both Hydra (3 June) and Vega (12 June) an important bug fix to the OpenBLAS module OpenBLAS/0.3.1-GCC-7.3.0-2.30. The bug was impacting matrix operations (dot product and/or multiplication) using OpenMP: if the product of the matrix size and the number of cores used for the computation is above ~10000, the result could be incorrect in some cases. Jobs on a single core were not impacted by this bug. See https://github.com/xianyi/OpenBLAS/issues/1673 for more details. Users who have previously run multi-core jobs with software relying on the foss/2018b or fosscuda/2018b toolchains should check the correctness of their results. We advise to sample your results by rerunning a few jobs. If one new result is different with the patched module, all jobs executed with the problematic module should be re-submitted. Contact us at hpc@vub.ac.be or hpc@ulb.ac.be for more information or help. More info on the fix: - https://github.com/xianyi/OpenBLAS/commit/b14f44d2adbe1ec8ede0cdf06fb8b09f3c4b6e43 - https://github.com/easybuilders/easybuild-easyconfigs/pull/8396/files

  • 03/06/2019 June

    Memory upgrade on Hydra storage completed

    The memory upgrade on the four storage controller nodes of Hydra completed successfully this Monday 3rd of June at 11:00. The jobs scheduler is again active and therefore queuing jobs are being started.

  • 03/06/2019 June

    Memory upgrade on Hydra storage

    We'll add extra memory to the four controller nodes of the Hydra storage this Monday 3rd of June. The job scheduler will be place on pause before executing the operation meaning that running jobs will not be impacted. The operation should be over by the afternoon.

  • 10/12/2018 December

    An extra series of training session is available

    Due to the popularity of the training sessions (all seats booked in less than 24 hours), we have decided to organise another one: a hands-on session on Linux is planned for January 14th and a hands-on session on HPC for January 15th 2019. See our Events page for more information.

  • 20/11/2018 November

    Works on the ULB high voltage power grid completed

    Works on the ULB high voltage power grid are completed. Hydra and Vega HPC clusters are back online and jobs are running.

  • 07/11/2018 November

    Works on the ULB high voltage power grid

    There will be works carried on the ULB high voltage power grid on Monday 19 and Tuesday 20 of November. The works will imply two power cuts which will impact the Hydra and Vega HPC clusters. We have therefore planned a downtime for both clusters from 19/11 15:00 until 20/11 morning. As soon as the works on the power grid are completed, we’ll put back the clusters online and send a notification. A reservation has been placed to ensure that no jobs are running during the maintenance window. This also imply that jobs with a walltime running into the maintenance window will be started only after the maintenance.

  • 03/10/2018 October

    A new series of training session is available

    The HPC team is pleased to announce a new series of training sessions, in collaboration with the Vlaams SupercomputerCentrum (VSC) and the Consortium des Équipements de Calcul Intensif (CECI), focussing on young and/or early stage researchers (doctoral students, Master thesis students). This time we'll propose the usual Linux and HPC hands-on sessions plus a new session on Grid computing. See our Events page for more information and registration.

  • 09/08/2018 August

    Short data access interruption

    A planned firmware upgrade of one of the Hydra switches lead to a short access interruption from the compute nodes to the Hydra storage. The outage took place between 8:40 and 8:50 this morning. The redundancy at the network and storage levels clearly did not work. Some jobs have been impacted and were lost. Users should check the output of terminated jobs for possible errors. The issue will be investigated and hopefuly a fix will be found. We apologize for any inconvenience caused.

  • 03/08/2018 August

    Hydra back at 100%

    Works on the power grid of the ULB/VUB SISC datacenter are over. Hydra is now back at 100% capabilities. Some nodes supposed to remain powered have been stopped which has caused the loss of some jobs. Users with jobs completed this Friday morning should check the outputs for possible execution interruption. Sorry for any inconvenience caused.

  • 03/08/2018 August

    Vega is back online

    Works on the power grid of the ULB/VUB SISC datacenter are over. Vega is now back online.

  • 30/07/2018 July

    Limited resources available on Hydra

    Works will take place on the power grid of the ULB/VUB SISC datacenter on Friday 3rd of August between 8:00 and 12:00. Some electric lines powering Hydra will be offline. We have therefore prepared the cluster to run with a reduced number of nodes. Access to Hydra and the data will remain possible and jobs will continue to run. The works should be completed within Friday morning and Hydra will be again fully operational in the afternoon.

  • 30/07/2018 July

    Vega stopped on Friday 3rd of August between 8:00 and 12:00

    Works will take place on the power grid of the ULB/VUB SISC datacenter on Friday 3rd of August between 8:00 and 12:00. Electric lines powering Vega will be offline. We have therefore prepared the cluster for a shutdown. No job will be running at the time of the shutdown (a reservation on all the nodes is in place) and therefore no expected loss. Queued jobs will be maintained as well. Once the works are completed, Vega will be powered back and will be online beginning of the afternoon, or sooner.

  • 12/07/2018 July

    Hydra storage temporarily unavailable

    This morning at 9:20 a routine operation on the Hydra network was being performed by VUBNET which took down connectivity between Hydra storage and the compute nodes. The problem was quickly identified and connectivity restored. Some operations followed to recover pending data not written to the storage (no data loss is expected) and Hydra storage was fully recovered at 10:00. Unfortunately, almost all jobs crashed during storage outage. Please check your completed jobs of this morning for eventual error messages.

  • 30/06/2018 June

    Maintenance completed successfully

    We have completed the maintenance works on Hydra. It took slightly longer then expected. Our apologies for that. The cluster is again accessible and is running jobs. The nodes operating system was updated, data reorganised on the Hydra storage (nothing changed on users' side), Torque updated to the latest version and the maximum walltime reduced to 5 days.

  • 28/06/2018 June

    Maintenance completed successfully

    We have completed the maintenance works: Vega is again accessible and queued jobs are now running. The nodes operating system was updated and Slurm was upgraded to version 17.11.7.

  • 25/06/2018 June

    Planned maintenance from 25/06/2018 to 27/06/2018

    Vega will be on maintenance from 25/06 to 27/06 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Vega core services. The maintenance will not kill jobs on the cluster or remove queued jobs. We have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 25/06 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.

  • 25/06/2018 June

    Planned maintenance from 25/06/2018 to 29/06/2018

    Hydra will be on maintenance from 25/06 to 29/06 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Hydra core services. We will change the maximum walltime from 14 days to 5 days. Pending jobs still in the queue and with a walltime above 5 days will be updated accordingly. The maintenance will not kill jobs on the cluster or remove queued jobs. As for previous planned Hydra downtimes, we have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 25/06 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.

  • 15/05/2018 May

    Hydra & Vega back online

    We have identified jobs on Hydra involved in the central storage NFS service crash. There was ~100 identical jobs running a short loop with an open-write-close action on the same file located in the Hydra home directory. This led to 1) massive IOPS on the NFS server and 2) NFS lock race condition that eventually lead to the NFS service crash. We will continue to work with the storage vendor to help them figure out why their NFS service crashed. This incident is also a good reminder for all to avoid using the Hydra home directory for IO intensive jobs. We have deployed a dedicated storage on Hydra capable to sustain massive IO loads and you can use it via your work directory. Hydra and Vega clusters are again fully operational.

  • 12/05/2018 May

    Hydra & Vega still unavailable

    After further investigations of the NFS issue on the SISC central storage it came out that when specific private networks are made accessible to the central storage, the crash follows a few hours after. Hydra and Vega are accessing the NFS via private networks involved in the NFS crash phenomenon. To preserve other critical services relying on, but not impacting, the central storage (like the email service), Hydra and Vega private networks have been cut out of the central storage. At this stage we still don't know if it is a rogue process or a network related issue. We continue to investigate in collaboration with SISC colleagues and the network teams.

  • 09/05/2018 May

    Hydra home directories unavailable

    The SISC central storage is encountering issues making the Hydra home directories unavailable. Login to Hydra will be impossible and running jobs relying on files stored on the home directories will fail. The job scheduler has been stopped to prevent new jobs starting. The issue is being investigated and further information will be posted once the issue has been resolved.

  • 03/05/2018 May

    VSC Users Day 2018

    The VSC Users Day 2018 will take place on 22nd of May 2018 at the Koninklijke Vlaamse Academie van België voor Wetenschappen en Kunsten. See https://www.vscentrum.be/en/education-and-trainings/detail/vsc-user-day-22052018

  • 18/04/2018 April

    10th CÉCI Scientific Meeting

    The 10th CÉCI Scientific Meeting will take place on 4th of May 2018 at UNamur. See http://www.ceci-hpc.be/scientificmeeting.html

  • 19/03/2018 March

    PRACE Call for Proposal

    PRACE has issued the 17th call for Proposals.
    Deadline: 2nd May 2018, 10:00 CET.
    Stake: single-year and multi-year proposals starting 2nd October 2018.
    Resources: Joliot-Curie, Hazel Hen, JUWELS, Marconi, MareNostrum IV, Piz Daint and SuperMUC.

    See http://www.prace-ri.eu/prace-project-access

  • 27/02/2018 February

    Switch failure: solved

    We replaced the dying switch by a new one (thanks VUBNET team!). The cluster is back online. Note that jobs that were running at the time of the switch faiure have been lost. We kept Slurm stopped to prevent other job losses. Jobs that were in the queue are now running. New jobs can be submitted now.

  • 26/02/2018 February

    Switch failure

    It seems that one of the Gbps switches on Vega is dying (ports going down and ramdom restart). We are investigating the issue and will replace the switch if this is a hardware issue. The cluster is currently offline and will stay as such a bit longer if the switch must be replaced.

  • 30/01/2018 January

    Storage fixes

    Broken disks have been replaced on Vega (no impact on data availability). A deep cleaning also recovered 20 TB of storage space.

  • 24/01/2018 January

    Gaussian version 16

    We have installed Gaussian version 16 on Hydra. To use this version, simply load the right module: module load gaussian16 For Gaussian 09 users: we recommend rerunning your latest jobs with G16 and compare the results.

  • 14/12/2017 December

    Maintenance completed successfully

    We have completed the maintenance works: Hydra is again accessible and queued jobs are now running. Works summary: 1) The entire Hydra Ethernet network has been rebuilt from scratch with new switches and new cabling. All network communications have been therefore improved for a globally increased performance in data management & transfers. 2) Four new GPGPU nodes have been added, each with 2x Tesla P100. 3) The storage capacity has been increased to 800 TB. 4) The usual OS/software updates & upgrades have been made including the installation of security patches.

  • 24/11/2017 November

    Planned maintenance from 11/12/2017 to 15/12/2017

    We are planning a maintenance window on Hydra from 11/12 to 15/12 with a scheduled downtime. The maintenance will be used to perform several updates/upgrades on Hydra core services including a complete network upgrade (physical and logical levels), 4 new GPGPU nodes and an increase of the storage capacity to 750 TB. The maintenance will not kill jobs on the cluster or remove queued jobs. As for previous planned Hydra downtimes, we have placed a reservation on the cluster two weeks before the maintenance date, aligned with the maximum walltime. Jobs that can complete before the 11/12 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.

  • 23/09/2017 September

    Planned maintenance on 25/09/2017

    We are planning a maintenance window on Hydra on 25/07 with a scheduled downtime. The downtime should last a few hours. A global reservation on Hydra has been created to make sure that no job will be running on Sept. 25. Jobs that can complete before the date will be executed and those with a walltime beyond Sept 25 will be maintained in the queue.

  • 18/09/2017 September

    Advanced monitoring

    We are redirecting all the Vega logs to our ELK platform. The ELK analytic capabilities will permit to better spot issues with the compute nodes and with the submitted/running job. Our first objective: spot jobs wasting CPU or memory resources.

  • 11/09/2017 September

    Compute nodes reparation

    Long standing hardware issues with some Vega compute nodes is being tackled. We'll remove the problematic nodes from Vega, fix them and place them back in Vega. The total number of available compute nodes will vary in the coming days.

  • 11/09/2017 September

    Ariel Lozano is the new CÉCI logisticien

    We are happy to announce the hire of a new CECI logisticien: Ariel Lozano. Welcome Ariel!

  • 06/06/2017 June

    Planned maintenance - 03/07 to 06/07

    We are planning a maintenance window on Hydra from 03/07 to 06/07 with a scheduled downtime. A reservation will be placed on the cluster from 19/06. Jobs that can complete before the 03/07 will (continue to) run and those submitted with a walltime bumping into the maintenance window will be kept pending until the maintenance is completed.

  • 09/05/2017 May

    Cooling system operational

    The cooling system is now back to normal. The Vega and Hydra clusters are fully operational too.

  • 09/05/2017 May

    Air-co failure: Hydra and Vega stopped

    This morning at ~ 9:30 the datacentre cooling system failed. Given the temperature increase in the datacentre we had to switch off the Hydra and Vega compute nodes. Once the cooling system has been repared, we'll switch on the compute nodes. Jobs that were running will be lost but those in the queue have been kept. All data are safe on the Hydra storage. We'll send another notification once the clusters are back online.

  • 26/04/2017 April

    Vacancy account manager VSC and industry

    The VSC has a vacancy for an account manager VSC and industry. You will be employed by the Fonds Wetenschappelijk Onderzoek - Vlaanderen. Closing date for applications is June 12, 2017. Given the nature of the job, the vacancy is only available in Dutch. See the detailed ammounce at https://www.vscentrum.be/en/news/detail/vacancy-account-manager-vsc-and-industry

  • 26/04/2017 April

    VSC Users Day 2017

    This user day will take place on June 2, 2017 in Brussels. For more information please see https://www.vscentrum.be/events/userday-2017

  • 02/04/2017 April

    20 000 000 core-hours on a PRACE cluster allocated to two CÉCI users

    UCL professors Gian-Marco Rignanese and Jean-Christophe Charlier have been granted 20 millions core-hours on Marconi KNL (CINECA, Italy). Congratulation to them!

  • 21/03/2017 March

    Register for the Ninth CÉCI Scientific Meeting!

    The ninth CÉCI scientific meeting day is organised in Louvain-la-Neuve on April 21st. More information and registration: see http://www.ceci-hpc.be/scientificmeeting.html

  • 02/03/2017 March

    The CÉCI common filesystem is ready

    The CÉCI common filesystem is fully available on the 6 CÉCI clusters. For example, the partition /CECI/home/ is directly accessible from all the login and compute nodes on all the clusters. Make sure to try it out! More information on https://support.ceci-hpc.be/doc/_contents/ManagingFiles/TheCommonFilesystem.html

  • 21/10/2016 October

    BrENIAC: The new Tier-1 supercomputer in Flanders

    The next four years the KU Leuven will host the VSC Tier-1 supercomputer. We like to present this supercomputer to you. See https://www.vscentrum.be/en/news/detail/breniac-the-new-tier-1-supercomputer-in-flanders