During the Ironic sessions at the recent OpenStack Dublin PTG in Spring 2018, there were some discussions on adding a further burn in step to the OpenStack Bare Metal project (Ironic) state machine. The notes summarising the sessions were reported to the openstack-dev list. This blog covers the CERN burn in process for the systems delivered to the data centers as one example of how OpenStack Ironic users could benefit from a set of open source tools to burn in newly delivered servers as a stage within the Ironic workflow.
CERN hardware procurement follows a formal process compliant with public procurements. Following a market survey to identify potential companies in CERN's member states, a tender specification is sent to the companies asking for offers based on technical requirements.
Server burn in goals
Following the public procurement processes at CERN, large hardware deliveries occur once or twice a year and smaller deliveries multiple times per year. The overall resource management at CERN was covered in a previous blog. Part of the steps before production involves burn in of new servers. The goals are
- Ensure that the hardware delivered complies with CERN Technical Specifications
- Find systematic issues with all machines in a delivery such as bad firmware
- Identify failed components in single machines
- Provoke early failure in failing components due to high load during stress testing
Depending on the hardware configuration, the burn-in tests take on average around two weeks but do vary significantly (e.g. for systems with large memory amounts, the memory tests alone can take up to two weeks). This has been found to be a reasonable balance between achieving the goals above compared to delaying the production use of the machines with further testing which may not find more errors.
Successful execution of the CERN burn in processes is required in the tender documents prior to completion of the invoicing.
Workflow
The CERN hardware follows a lifecycle from procurement to retirement as outlined below. The parts marked in red are the ones currently being implemented as part of the CERN Bare Metal deployment.
As part of the evaluation, test systems are requested from the vendor and these are used to validate compliance with the specifications. The results are also retained to ensure that the bulk equipment deliveries correspond to the initial test system configurations and performance.
Preliminary Checks
CERN requires that the Purchase Order ID and an unique System Serial Number are set in the NVRAM of the Baseboard Management Controller (BMC), in the Field Replaceable Unit (FRU) fields Product Asset Tag (PAT) and Product Serial (PS) respectively:
# ipmitool fru print 0 | tail -2
Product Serial : 245410-1
Product Asset Tag : CD5792984
# ipmitool fru print 0 | tail -2
Product Serial : 245410-1
Product Asset Tag : CD5792984
The Product Asset Tag is set to the CERN delivery number and the Product Serial is set to the unique serial number for the system unit.
Likewise, certain BIOS fields have to be set correctly such as booting from network before disk to ensure the systems can be easily commissioned.
Once these basic checks have been done, the burn in process can start. A configuration file, containing the burn-in tests to be run, is created according on the information stored in the PAT and PS FRU fields. Based on the content of the configuration file, the enabled tests will automatically start.
Burn in
The burn in process itself is highlighted in red in the workflow above, consisting of the following steps
- Memory
- CPU
- Storage
- Benchmarking
- Network
Memory
The memtest stress tester is used for validation of the RAM in the system. Details of the tool are available at http://www.memtest.org/.
CPU
Testing the CPU is performed using a set of burn tools, burnK7 or burnP6, and burn MMX. These tools not only test the CPU itself but are also useful to find cooling issues such as broken fans since the power load is significant with the processors running these tests.
Disk
Disk burn ins are intended to create the conditions for early drive failure. The bathtub curve aims to cause the early failure drives to fail prior to production.
With this aim, we use the badblocks code to repeatedly read/write the disks. SMART counters are then checked to see if there are significant numbers of relocated bad blocks and the CERN tenders require disk replacement if the error rate is high.
We still use this process although the primary disk storage for the operating system has now changed to SSD. There may be a case for minimising the writing on an SSD to maximise the life cycle of the units.
Benchmarking
Many of the CERN hardware procurements are based on price for total compute capacity needed. With the nature of most of the physics processing, the total throughput of the compute farm is more important than the individual processor performance. Thus, it may be that the most total performance can be achieved by choosing processors which are slightly slower but less expensive.
CERN currently measures the CPU performance using a set of benchmarks based on a subset of the SPEC 2006 suite. The subset, called HEPSpec06, is run in parallel on each of the cores in the server to determine the total throughput from the system. Details are available at the HEPiX Benchmarking Working Group web site.
Since the offers include the expected benchmark performance, the results of the benchmarking process are used to validate the technical questionnaire submitted by the vendors. All machines in the same delivery would be expected to produce similar results so variations between different machines in the same batch are investigated.
CPU benchmarking can also be used to find problems where there is significant difference across a batch, such as incorrect BIOS settings on a particular system.
Disk performance is checked using a reference fio access suite. A minimum performance level in I/O is also required in the tender documents.
Networking
Networking interfaces are difficult to burn in compared to disks or CPU. To do a reasonable validation, at lest two machines are needed. With batches of 100s of servers, a simple test against a single end point will produce unpredictable results.
Using a network broadcast, the test finds other machines running the stress test, they pair up and run a number of tests.
- iperf3 is used for bandwidth, reversed bandwidth, udp and reversed udp
- iperf for full duplex testing (currently missing from iperf3)
- ping is used for congestion testing
Looking forward
CERN is currently deploying Ironic into production for bare metal management of machines. Integrating the burn in and retirement stages into the bare metal management states would bring easy visibility of the current state as the deliveries are processed.
The retirement stage is also of interest to ensure that there is no CERN configuration in the servers (such as Ironic BMC credentials or IP addresses). CERN has often donated retired servers to other high energy physics sites such as SESAME in Jordan and Morocco which requires a full server factory reset before dismounting. This retirement step would be a more extreme cleaning followed by complete removal from the cloud.
Discussing with other scientific laboratories such as SKA through the OpenStack Scientific special interest group has shown interest in extending Ironic to automate the server on-boarding and retirement processes as described in the session at the OpenStack Sydney summit. We'll be following up on these discussions at Vancouver.
Acknowledgements
- CERN IT department - http://cern.ch/it
- CERN Ironic and Rework Contributors
- Alexandru Grigore
- Daniel Abad
- Mateusz Kowalski
References
- Computing in High Energy Physics procurement paper (CHEP 2013) - http://iopscience.iop.org/article/10.1088/1742-6596/513/6/062003/pdf
- Automatic server registration and burn-in framework (HEPiX Fall 2013) - https://indico.cern.ch/event/247864/contributions/1570349/attachments/426708/592287/Automatic_Registration_Burn_in.ppt
- Backblaze failure report - https://www.backblaze.com/blog/hard-drive-stats-for-2017/
- Resource Management at CERN - http://openstack-in-production.blogspot.fr/2016/04/resource-management-at-cern.html
Montedo é o ministro da economia brasileiro e os nossos carros já estão com o IPVA SP pagos e o Licenciamento anual em dia. E o nosso Coluna do Flamengo ? fd
ReplyDeleteThis blog is truly useful to convey overhauled instructive undertakings over web which is truly examination. I discovered one fruitful case of this truth through this blog. I will utilize such data now.
ReplyDeleteWindows Server Standard 2019
This information is really awesome thanks for sharing most valuable information.
ReplyDeleteOpenstack Training
Openstack Certification Training
OpenStack Online Training
Openstack Training Course
Openstack Training in Hyderabad
This information is really awesome thanks for sharing most valuable information.
ReplyDeleteOpenstack Training
Openstack Certification Training
OpenStack Online Training
Openstack Training Course
Openstack Training in Hyderabad
The benefits of a bread machine in the kitchen are limitless. You can bake numerous treats in addition to yeast breads, and some of the best are cakes, pound cakes, and quick breads. dumpling maker
ReplyDeleteThank you sharing this Information
ReplyDeleteI also found Various useful links related to Devops, Docker & Kubernetes
Kubernetes Kubectl Commands CheatSheet
Introduction to Kubernetes Networking
Basic Concept of Kubernetes
Kubernetes Interview Question and Answers
Kubernetes Sheetsheat
Docker Basic Tutorial
Linux Sar Command Tutorial
Linux Interview Questions and Answers
Docker Interview Question and Answers
OpenStack Interview Questions and Answers
Your article has aroused my curiosity. This is unquestionably a mastermind's article with incredible substance and intriguing perspectives. I concur partially with a great deal of this substance. Much thanks to you for sharing this educational material.
ReplyDeleteSEO services in kolkata
Best SEO services in kolkata
SEO company in kolkata
Best SEO company in kolkata
Top SEO company in kolkata
Top SEO services in kolkata
SEO services in India
SEO copmany in India
I am speechless as to how great this article is. The content is simple to understand and very engaging. This is by far some of the best content I've seen.
ReplyDeleteDenial management software
Denials management software
Hospital denial management software
Self Pay Medicaid Insurance Discovery
Uninsured Medicaid Insurance Discovery
Medical billing Denial Management Software
Self Pay to Medicaid
Charity Care Software
Patient Payment Estimator
Underpayment Analyzer
Claim Status
Health Hint
ReplyDeleteHealth Hint
Health Hint
Health Hint
Health Hint
Health Hint
Health Hint
Health Hint
looking great information in the article .Today internet are the biggest issue in the business and daily life routine so its most important to need the keep doing sharing such type of information .If any one wants to know more about the fiber optic cable so must visit the given resource
ReplyDeleteIt's really amazing to have many lists of will help to make thanks a lot for sharing
ReplyDeleteve may bay tu My ve Viet Nam
vé máy bay từ pháp về việt nam
vé máy bay từ singapore về đà nẵng
khi nào có chuyến bay từ úc về việt nam
mở lại đường bay việt nam - hàn quốc
đặt vé máy bay giá rẻ tu Nhat Ban ve Viet Nam
Nice post brother, I have been surfing online more than 3 hours today, yet I never found any interesting article like yours. It is pretty worth enough for me. In my view, if all web owners and bloggers made good content as you did, the internet will be much more useful than ever before. There is certainly a lot to know about this issue.I love all of the points you’ve made. I am sure this post has touched all the internet viewers.
ReplyDeleteLịch bay từ Seoul đến Hà Nội
vé máy bay từ úc về việt nam giá rẻ
vé máy bay từ san francisco về việt nam
Giá vé máy bay từ Đài Loan về TPHCM
lịch bay từ mỹ về việt nam hôm nay
thông tin chuyến bay từ canada về việt nam
Fascinating blog! Is your theme custom made or did you download it from somewhere?A theme like yours with a few simple tweeks would really make my blog jump out. Please let me know where you got your design. With thanks
ReplyDeletechuyến bay từ pháp về việt nam hôm nay
vé máy bay giá rẻ tu italia ve Viet Nam
san ve may bay gia re tu ha lan ve Viet Nam
Giá vé máy bay Vietnam Airline tu new zealand ve Viet Nam
Cách đặt vé máy bay tu Dubai ve Viet Nam
Mua vé máy bay đi ngày tu Doha ve Viet Nam
토토사이트
ReplyDelete토토
My companion referenced to me your blog, so I thought I'd read it for myself. Interesting experiences, will be back for additional!
배트맨토토
ReplyDelete토토사이트
Cool you write, the information is very good and interesting, I'll give you a link to my site.
스포츠중계
ReplyDelete토토
Thanks for such a great post and the review, I am totally impressed! Keep stuff like this coming.
Thanks for sharing this information
ReplyDeleteServiceNow Online Training in Hyderabad
ServiceNow Training
Phòng khách thì cần bao nhiêu đèn
ReplyDeleteNhững kiểu phòng ngủ sắp hot năm 2023
Cửa chống trộm ở ban công
Sai lầm khiến cho đồ nội thất nhanh hỏng
Xu hướng đèn 2023
Biện pháp làm ấm không gian tối giản
The rdatao a great job of providing current information on a variety of tech-related subjects, including advancements in artificial intelligence.
ReplyDelete