Berkeley Lab's National Energy Research Scientific Computing Center (NERSC) has an opening for a Site Reliability Engineer within the Operations area. The Operations team manages the NERSC HPC Data Center to ensure resources are available to 7000 global users on a 24x7 basis. The team also manages a data warehouse and notification infrastructure that must be available to continuously collect or queue data from heterogeneous data sources throughout the NERSC computational facility.
In this shift-based role, you will provide a variety of engineering support services in a 24x7 environment for the primary scientific computational facility for the Office of Science in the US Department of Energy (DOE) to ensure that NERSC is accessible, reliable, secure, and available to our scientific users. Additionally, this role will work with teams to provide solutions on the ServiceNow platform as well as implement, deliver applications and integrations in open source platforms in a fast-paced agile project-based environment.
What You Will Do:
Management of the Data Center:
Work 5 shifts per week to manage the NERSC HPC Facility. Some days may be onsite, some may be offsite and the schedule will be determined by staffing needs.
Review and respond to alerts from computer systems, storage, network, and other data center/facility related systems by triaging or calling appropriate on-call staff.
Respond to alerts from the OMNI cluster (data warehouse) to ensure that the system continues to collect data 24x7 to provide real time information for diagnoses.
Management of the NOW platform:Develop solutions to address general updates and configuration changes/requests.
Data Analysis and Visualization: Use Kibana and Grafana to analyze and diagnose the health of HPC systems using plots and data analysis.
Create new plots and alerting schemes as new data sets become available.
What is Required:
Bachelor's Degree in Computer Science or a similar discipline and 8 years of relevant experience or an equivalent combination of work experience, education and certifications.
Hands-on experience as a Linux (or similar type of operating system) system administrator or system engineer in a customer-facing environment supporting data clusters, managing the replacement of hardware, and ensuring its continued availability to the user community. This can include assisting in the deployment of new nodes and internal switches into production, resolving ticket incidents, and working with vendors on hardware warranty replacements.
Hands-on application software development in the NOW framework or similar platform. Must understand ITOM processes such as Incident Management, Change Management and Problem Management within the NOW framework.
Demonstrated experience in a UNIX or Linux environment with an understanding of systems, storage, and network administration to be able to respond to data center facility issues, and alerts from systems mentioned.
Demonstrated experience as a site reliability engineer or similar position with demonstrated skills in the following:
container management like Kubernetes.
virtualization technologies like oVirt.
systems monitoring software like Prometheus.
a data warehouse management system like the Elastic stack or VictoriaMetrics.
Demonstrated skills in the ELK stack's visualization software like Kibana and Grafana with the knowledge to assist other groups to create plots of or analysis of their data.
Hands-on experience with developing and maintaining diagnostic tools using programming languages like C, C++, python, java, or Perl, using knowledge of standard software development practices.
Networking: understanding of network theory and concepts such as TCP/IP, UDP, ICMP (networking protocols in general), MAC addresses, IP packets, DNS, OSI layers, and load balancing.
Experience with network security such as configuring/maintaining ACLs and knowledge of firewalls.
NOW platform certification.
Knowledge of AJAX, HTML, CSS, and SOAP.
Knowledge of AngularJS.
Network programming or a network certification.
A certification in a system administration area.
This is a full-time career appointment, exempt (monthly paid) from overtime pay.
This position may be subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
This position will be remote initially, but limited to individuals residing in the United States tentatively due to COVID-19. Once the Bay Area shelter-in-place restrictions are lifted, work will be primarily performed at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA.
Equal Employment Opportunity: Berkeley Lab is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status. Berkeley Lab is in compliance with the Pay Transparency Nondiscrimination Provision under 41 CFR 60-1.4. Click here to view the poster and supplement: "Equal Employment Opportunity is the Law."
Internal Number: 92829
About Lawrence Berkeley National Laboratory
In the world of science, Lawrence Berkeley National Laboratory (Berkeley Lab) is synonymous with excellence. Thirteen scientists associated with Berkeley Lab have won the Nobel Prize. Fifty-seven Lab scientists are members of the National Academy of Sciences (NAS), one of the highest honors for a scientist in the United States. Thirteen of our scientists have won the National Medal of Science, our nation's highest award for lifetime achievement in fields of scientific research. Eighteen of our engineers have been elected to the National Academy of Engineering, and three of our scientists have been elected into the Institute of Medicine. In addition, Berkeley Lab has trained thousands of university science and engineering students who are advancing technological innovations across the nation and around the world. Berkeley Lab is a member of the national laboratory system supported by the U.S. Department of Energy through its Office of Science. It is managed by the University of California (UC) and is charged with conducting unclassified research across a wide range of scientific disciplines. Located on a 200-acre site in the hills above the UC Berkeley campus that offers spectacular... views of the San Francisco Bay, Berkeley Lab employs approximately 4,200 scientists, engineers, support staff and students. Its budget for 2011 is $735 million, with an additional $101 million in funding from the American Recovery and Reinvestment Act, for a total of $836 million. A recent study estimates the Laboratory's overall economic impact through direct, indirect and induced spending on the nine counties that make up the San Francisco Bay Area to be nearly $700 million annually. The Lab was also responsible for creating 5,600 jobs locally and 12,000 nationally. The overall economic impact on the national economy is estimated at $1.6 billion a year. Technologies developed at Berkeley Lab have generated billions of dollars in revenues, and thousands of jobs. Savings as a result of Berkeley Lab developments in lighting and windows, and other energy-efficient technologies, have also been in the billions of dollars. Berkeley Lab was founded in 1931 by Ernest Orlando Lawrence, a UC Berkeley physicist who won the 1939 Nobel Prize in physics for his invention of the cyclotron, a circular particle accelerator that opened the door to high-energy physics. It was Lawrence's belief that scientific research is best done through teams of individuals with different fields of expertise, working together. His teamwork concept is a Berkeley Lab legacy that continues today.