CSEP Gitlab
SCEC's CSEP project is coordinating an earthquake and ground motion forecasting project with national and international partners. Project members want to use the Gitlab system to support the development of the CSEP open-source scientific software development community, the code development, distributed version control, and automated software testing environment. Specifications describing the expected system requirements and preliminary data creation and transfer estimates are below.
Contents
CSEP CI/CD System Overview Diagram
Data Usage Estimates
Estimates for data usages are based on combining the two use cases for the GitLab repository (1) Code storage and (2) Storage of Large data files and experiment results. These estimates are based on data from the current CSEP project, such as current project members, current storage in current and old repositories, and rough estimates of expected data usage for past and future experiments.
- amount generated internally by the system per year, most of which can be left on AWS storage: 5TB/year
- amount of data transferred into AWS: 1TB/year
- amount of data transferred out of AWS: 2.5TB/year
Data Estimates Breakdown
Project Info:
CSEP Project Members: 20
CSEP1 Repo Size (in Gb): 1.2
Storage per experiment (in Gb): 10
(1) Code Storage
Total data stored: 25 Gb (High-end estimate including models, codes and benchmark data sets)
Data transferred: [50-500] Gb
(2) Data/Catalog Storage
Total Storage: 250 Gb
Data transferred: [100 Gb - 1 Tb]
(3) Experiment Results
Total Storage: 4 Tb
Data transferred: [100 Gb - 1 Tb]
Overall Estimates:
Total stored: ~4.5 Tb
Transferred: [250 Gb - 2.5 Tb]
AWS Estimation Webpage
Gitlab Basic System Requirements
- Recent Linux Distro (Ubuntu,Centos...)
- Generate SSH keys
- Configure SMTP Server
- 8GB RAM is the recommended minimum memory size for all installations and supports up to 100 users. 16GB RAM supports up to 500 users
- Databases: PostgreSQL
- Redis/Sidekiq: stores all user sessions and background task queue processes the background jobs with a multithreaded process
- Prometheus and it’s exporters
- Avoid installing GitLab Runner on the same machine where Gitlab is installed.
- GitLab needs JavaScript enabled in browsers to support features such as Issue Boards.
CSEP Computer and Storage Inventory
- CSEP_Computers
- CSEP_Hardware_Inventory
- Github Docs
- SCEC CSEP Storage Summaries:
- SCEC Storage Summary: Usage Report
- Sorted by Username: Sorted Usage Report
Overview and Gitlab Installation
- Gitlab integration software.
- Gitlab Software Options
- GitLab and Runner Installation Information
- GitLab Installation using Docker
- GitLab Runner Installation with Docker
AWS Estimate
Here is follow-up information for you to look over to get a better idea of the setup, general information, and pricing of the services.
- Running GitLab on AWS:
- GitLab has fantastic documentation to get you setup and running. Using their Omnibus package or Marketplace listings are quick and easy ways to get started. GitLab has different Marketplace listings based on which license is used, the community version is linked.
- It does require setup and maintenance on your end, it’s not a fully managed service. You would provision the GitLab instance and the Runner instance and complete their espective software install and setup.
- Pricing GitLab on AWS:
- Factors to consider:
- Storage:
- S3 (Simple Storage Service) is a great service for storing your data. It is highly scalable, durable, and available object storage where you don’t need to provision anything. It’s significantly cheaper than holding the data in volumes attached to your instances.
- The EBS (Elastic Block Store) volumes are charged based on their provisioned size and are resizable, so you can start off smaller and resize if needed.
- Compute:
- EC2 instance savings plans can greatly reduce your compute cost, ranging from around 30% to 60% compute savings over the on-demand instances shown in the price estimates. It varies based on the commitment time (1 or 3 years) and payment options (no upfront to full upfront).
- GitLab has a guide on autoscaling Runners on AWS: https://docs.gitlab.com/runner/configuration/runner_autoscale_aws/
- Pricing estimates:
- The wiki setup estimate: https://calculator.aws/#/estimate?id=a540551f0239aef761c25ec52c9b87f5c0571563
- $8500/year
- This has two instances, the GitLab instance with 2vCPU and 8GB of RAM, and the Runner instance with 4vCPU and 16GB of RAM. Each has 4TB of hard drive storage. This estimate does not include backups.
- The lion’s share of the cost from this estimate comes from the 8TB of provisioned storage attached to the instances, the majority of which can be held in S3.
- Moving storage to S3 estimate: https://calculator.aws/#/estimate?id=636f64ec1dff1f2f2e8aacfe50793480e3dc2160
- $5700/year
- The same setup as above but with the majority of storage (5TB) offloaded into S3. The GitLab instance has 50GB of SSD storage and the Runner instance has 250GB of SSD storage, each with daily backups. Storage performance for the instances will be better due to using SSDs instead of hard drives.
- Including with EC2 Instance Savings Plans: https://calculator.aws/#/estimate?id=7f6a08fbe7e1feba574cd8daf15145773f6fcb27
- $4500/year
- Same as the previous estimate but with EC2 one year savings plans paid upfront.
- Serverless Managed CI/CD Solution, CodePipeline:
- Fully managed CI/CD pipeline solution where you would just manage the users.
- Source Control : AWS CodeCommit is a private Git service that can securely store your source code, binaries, and application assets.
- Build: AWS CodeBuild allows you to build and test your application in preconfigured or custom environments.
- Deploy: AWS CodeDeploy automates software deployments to your AWS or on-premises servers.
- Usage based pricing, there are no servers to provision or maintain.
- Free tier eligible so you can test out the services for free.
- Natively integrates with other AWS services and control plane.
- More information about CodePipeline: https://aws.amazon.com/codepipeline/
- Pricing estimates:
- $2500/year
- I’ve attached a spreadsheet with modifiable red input numbers to get an idea of how much this solution will cost.
- The 5TB of storage on S3 with 2.5TB of data transfer out per year ($1821/year) would be in addition to CodePipeline costs. For example, if the spreadsheet outputs $650/year, the total with S3 data storage, transfer, and CI/CD would be $2471/year.
- Fully managed CI/CD pipeline solution where you would just manage the users.
Configurable Excel Spreadsheet