Deploying Updated Repositories

Deployment of microservices is handled with a tool called “AWS Copilot” which is a system that was designed to simplify most of the deployment steps by setting up many of the more abstract configurations in deployment.

With copilot you can push the repo as it exists in your system + selected branch by running the command copilot svc deploy from the root of the directory that defines your container. Select the components (i.e. frontend or backend) and the environment (i.e. prod or development) you wish to push to. This will push your repo to the repo registry in the AWS cloud and then restart all of your containers with the new version.

If something in the deployment fails (i.e. a circuit breaker is triggered or a health check fails) then copilot will automatically (albiet very slowly) rollback to the prior stable version. As a rule of thumb always deploy to dev first and confirm that it works before deploying to prod. It’s ok for deployment to dev to fail in which case you should assess what went wrong, make adjustments if necessary, and then try again.

Potential Improvement: Setup a CI/CD pipeline within github. This could be used to automatically push deployments from specific branch so that you don’t have to go and push new deployments manually.

Parameters

There are various parameters and permissions setup for both the frontend and backend services that must be stored securely. These are things like API keys and secrets that allow for access to external tools (i.e. AWS access and Mailgun).

Parameters and permissions must be specially configured to support all deployed environment (prod and development). These parameters can be defined in AWS and then pulled into each container on launch, allowing the environment-appropriate variables to be automatically pulled from AWS and into the container for use. This is much more secure than stashing the values in a local .env file and pushing it with the rest of your repo to production.

Within the copilot/COMPONENT/manifest.yml file in the article_pipeline repo you will see various variables and secrets defined. Variable do not need to be secured and are stored directly in the file. Secretes need to be secured. They look like such…

COMPOUND_BUCKET_V1: /copilot/${COPILOT_APPLICATION_NAME}/${COPILOT_ENVIRONMENT_NAME}/secrets/COMPOUND_BUCKET_V1

Where COMPOUND_BUCKET_V1 will become the environmental variable in your repo (the same as if it where in the .env file) and the specified path is where in the AWS Parameter store the value can be found.

This process ultimately acts to create environmental variables that the container then uses in a deployed environment.

Updating / Adding Parameters

Parameters can be viewed / updated / managed using the AWS Parameter Store which can be found in Systems Manager -> Parameter store.

You can add new paramters directly in the interface or via the AWS CLI from your local terminal. If you’re going to create them by hand be very careful to meet all of the requirements listed below.

Values must be stored very specific ways to work within copilot.

All parameter entries MUST be stored as “secure strings” which just means that they are protected by encryption (the default settings are fine). If adding manually make sure this option is selected.
All parameter entries MUST have two specific TAGS associated with them or copilot will not have sufficient privilege to access them. If adding manually always verify these tags exist and are correct.
- key: copilot-application, value: article-pipeline (copilot application name)
- key: copilot-environment, value: development (must match the copilot environment that will pull your variable)
All parameter entries MUST match the names defined in the copilot manifest.yml files for the matching component (backend, frontend, etc.). For this project standardized naming is used as such… /copilot/COPILOT-APPLICATION/COPILOT-ENVIRONMENT/secrets/VARIABLE-NAME
If a parameter should be “empty” for a given environment (i.e. the development server doesn’t push to slack) then you should still make a parameter store value for it and set its value to a string that spells “None”. The code is setup to explicitly check for strings with these values and treat them as undefined when encountered. They cannot be empty strings since parameters store does not allow you to store them.

Parameters can be freely updated. If do update them you MUST restart or re-deploy containers that use them for the changes to take effect.

Local Parameter Setup

Take the .env.template folders in the settings folders of the frontend and backend. Copy the contents to your own .env file and put it in the same directory as the .env.template file. Replace all keys and values with local alternatives. Upon launching either the frontend or backend they should automatically pull your custom defined files.

For local setup you will need to perform some configuration to setup your own external resources to test with. The dockerfile should automatically create a database but for services that rely on cloud-based external systems consider setting up a separate account (perhaps your own) to host an S3 bucket that you and you alone will use for testing. This is because you will be using your own local database and an s3 bucket and a database must stay in sync with one another. For some cases, such as with the mailgun or slack webhooks, it’s fine to use the dev versions locally since they can be used across various testing instances without issue.

Deployment of a fresh instance

These are deployment instructions assuming you are starting from a fresh AWS account.

Create an NP-MRD Specific VPC

In AWS you must setup a private network that contains components that are able to talk to each other but can only be accessed by the public internet in very specific ways. This is called a Virtual Private Cloud (VPC). One account can have many. For NP-MRD all resources share the same cloud.

The public internet can access the VPC via virtual gateway which directs to a “Load Balancer”. The load balancer should only webpage generating infrastructure as we don’t want components like our database to be directly accessible from the internet.

If you are working from a new AWS account you should create a new VPC specifically for use with NP-MRD. Currently development and prod share the same VPC.

VPC Setup

Go to the VPC Panel and create an entry with the following settings…

Resources to create: VPC and more
Name tag auto-generation: True
Name-tag: Something that works (currently np-deposition)
IPv4 CIDR block: Default (10.0.0.0/16)
IPv6 CIDR block: No IPv6 CIDR block
Tenacity: Default
Number of Availability Zones: 2
First / SEcond Availability Zones: Use Default Values (i.e. us-west-2a and us-west-2b)
Number of Public Subnets: 2
Number of Private Subnets: 2
Customizable subnets CIDR blocks (should be fine to scramble these)
- Public subnet CIDR block in us-west-2a: 10.0.2.0/24
- Public subnet CIDR block in us-west-2b: 10.0.0.0/24
- Private subnet CIDR block in us-west-2a: 10.0.3.0/24
- Private subnet CIDR block in us-west-2b: 10.0.1.0/24
NAT gateways: None
VPC Endpoints None - Note: it would probably be better to setup a specialized s3 gateway for use.
Enable DNS hostnames: True
Enable DNS resolution: True

Create an AWS RDS (relational-database-services) instance

RDS Instance

Go to the AWS panel and configure an RDS instance. Here are the configuration settings. There are likely better settings that can be configured but these are what is currently used.

Full Configuration
PostgreSQL database
Engine Version: Default option should be fine
Templates: Production
Availability and durability: Single-AZ DB Instance (consider a larger option if need-be)
master username: postgres
Credentials management: Self management (save the password)
Instance config: Bustable classes / db.t3.small
Storage Type: General Purpose SSD (gp2)
Allocated Storage: 20gb (may eventually increase as database grows but total size of prod db is only ~400MB as of Dec 2025)
Storage Autoscaling: Enable
Max Storage Threshold: 1000Gb

Connectivity

Compute Resource: Don’t connect to an EC2 compute resource
VPC: Create a VPC, name it something appropriate, and use that (in prod this is np-deposition-vpc)
DB subnet group
- Create a new DB Subnet Group. This SHOULD automatically connect to the subnets setup in the VPC as long as they were configured as previously outlined. Here’s a list of the requirements…
  - At least two subnets
  - They are in different Availability Zones
  - They are PRIVATE subnets
  - They have valid IPv4 CIDRs
  - They belong to the selected VPC
Public Access: No
Certificate Authority: Default
Database Port: 5432

Database authentication

Database authentication: Password authentication

Monitoring - Database Insights - Standard - Enable Performance Insights: True (default) - Retention Period: 7 days (default) - AWS KMS key: aws/rds (default) - Enhanced Monitoring: True (default) - OS metrics granularity: 60 seconds (default) - Monitoring role for OS metrics: default - log exports: All False (default) - DevOps Guru: False

Additional Configuration - No need to create an initial database - DB Parameter Group: Newest Compatible (default.postgres17 on prod) - Enable Automated Backup: True - Backup Retention period: 7 days - Backup window: choose one and set duration to 0.5 hours - Copy tags to snapshots: True - Backup replication: False - Enable encryption: True - AWS KMS key: Default - Enable auto minor version upgrade: True - Maintenance window: Set 30 minutes after backup, duration = 0.5 hours - Enable Deletion Protection: True

IMPORTANT: Add Private VPC Access Point

By default my RDS instance only seemed to have a single public access point. In order for other apps inside the VPC to access it you need to setup an inbound rule in the VPCs security group to allow this. I chose to make a generic one that will allow for access from any app in the VPC but the better practice is probably to setup individual rules for each app. Regardless, here’s how to setup the generic one…

Go to Aurora and RDS
Click on your Database Instance
Click on Proxies
Click on the VPC Security Group instance
Check the instance
Select “Inbound rules” at the bottom and click Edit Inbound rules
Create a new rule as such and save it…
- Type: PostgreSQL
- Source: Custom
- Add your VPC’s IpV4 Cidr to the box (should default to “10.0.0.0/16”)

Creating a Deposition System Database

The easiest way to do this would likely be to use a client like pgAdmin and to connect to your AWS RDS instance directly such that you can then use commands to create a database. However, I’ve chosen to work by instead setting up an OpenVPN EC2 instance that exists in your VPC which serve as a machine you can launch as needed and that you can then SSH into to gain command line access to. This way, you will be able to run commands in a machine that exists within your VPC. This will allow you to freely create and manage databases with the full suite of postgres commands.

Launching an OpenVPN EC2 Instance

First, create an OpenVPN EC2 instance you can use. AWS Marketplace makes this super easy.

Go to AWS Marketplace -> Discover Products and search for “OpenVPN self hosted” then select to the top match. Click view purchase options. Purchase a “usage based” instance. Then click subscribe. Once that’s completed click “Launch your Software”.

In Setup make sure that you use the service “Amazon EC2” and then select “Launch from EC2 Console” (don’t use the one-click launch) using the latest stable version and the correct region. Then follow this config…

Instance Type

Instance type: Something small like t2.micro should be fine for our uses
Key Pair: Create a new one and save the .pem file on your local computer or use a pair if you already have the file

Network Settings

Select create security group.
Allow SSH Traffic from: Anywhere
Subnet: One of the public instances
Auto-assign public IP: True
Create a new security group and remove the HTTPS connection. leave all settings default (there’s probably a better way to do this).

Configure Storage

Setup 25 GiB of gp3

The instance will launch automatically. IMPORTANT: Make sure to shut it down when you don’t need to use it directly so it doesn’t rack up a bill. You can shut it down / launch it in EC2 -> Instances where you can use the Instance state to either start or stop the instance.

Now take the .pem file you created and put it in a directory you will remember in your local machine. Open that directory in your terminal and run the following command to ssh into it. Note that this is a functional option for managing your database in other instances (i.e. dumping it).

Click on the instance and check its Public IPv4 address to be able to connect to it.

chmod 400 YOUR_PEM_FILE.pem
ssh -i YOUR_PEM_FILE.pem openvpnas@PUBLIC_IPV4_ADDRESS # Default / open IP address

Use all the default settings. Create a password and write it down in case you need it later.

From here I suggest moving to a folder that you have read/write access to.

cd /usr/local/src

Launching an OpenVPN EC2 Instance

Once you’ve connected make sure to install the postgres client like such…

IMPORTANT: Make sure your installed psql version matches the postgres version in RDS, which you can check by going to the configuration page for your RDS instance under “Engine Version”.

sudo apt update
sudo apt install -y postgresql-client

Now you can connect to the RDS postgres database with this command. You can find the database proxy by going to Aurora and RDS -> Selecting the database, clicking on the linked “proxy” and then copy pasting the “proxy endpoint”.

psql -h DATABASE_PROXY_ENDPOINT -U POSTGRES_USERNAME

Install AWS CLI in the OpenVPN Instance

Create a folder to hold the package and install it.

sudo mkdir -p /usr/local/src/awscli
cd /usr/local/src/awscli
# Install AWS CLI
sudo curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt update
sudo apt install unzip -y
sudo unzip awscliv2.zip
sudo ./aws/install
aws --version

Finally Creating Article Pipeline Database

Now that everything is configured first connect to your postgres instance…

psql -h DATABASE_PROXY_ENDPOINT -U POSTGRES_USERNAME

Then create a database…

CREATE DATABASE new_db_name;

You will then need to make sure that the database URI is added to your parameter store so that your deployed applications can access this new database.

Determining Your Database URL (URI)

Once you’ve created a database you will need to setup a string that will direct the backend to it and allow for a connection. This should only work from a machine within your VPC since it must be in the same network (i.e. a deployed backend OR an EC2 OpenVPN). You will need to determine the IP of your database which you can do like this…

openvpnas@openvpnas3:~$ nc -vz DATABASE_PROXY_ENDPOINT 5432

RESPONSE:
Connection to DATABASE_PROXY_ENDPOINT (DATABASE_IP_ADDRESS) 5432 port [tcp/postgresql] succeeded!

Then take the DATABASE_IP_ADDRESS and use it to determine your postgres URI like such…

postgresql://POSTGRES_USERNAME:POSTGRES_PASSWORD@DATABASE_IP_ADDRESS:5432/DATABASE_NAME

This must then be added to your parameter store under the name “DATABASE_URL” so that your backend service can actually connect to your database.

Initialize Copilot

AWS CLI Config

First, you will need to ensure that you have a IAM information with CLI privileges. If you do not have one on AWS you can go to IAM -> Users (yourself) -> Create Access Key -> Create a new access key. Write down your access key and secret.

You need to have the awscli installed. On a mac you can install this with brew. If you’re on windows or linux look online for the equivalent.

brew install awscli

After that, in your local terminal you will need to configure credentials that will allow you to connect to the hosted AWS account. You can do this by running the following command…

aws configure --profile PROFILE-NAME

Then you will need to enter your access key and secret. This will allow you access to the AWS account from your terminal.

Copilot App Initialization

Locally, navigate your terminal to the repo of the application you will be deploying.

Make sure that you’ve selected your new profile

export AWS_PROFILE=PROFILE-NAME

Then initialize copilot within it.

copilot app init

This will… - Create a StackSet admin assumed by CloudFormation to manage region-specific stacks - Create an IAM role to create / access various AWS resources (ECR, S3, etc.)

Create S3 Buckets

Before deploying any applications you will need to ensure that s3 buckets have been to created that serve the needs your NP-Deposition service. Note you will need to create

COMPOUND_BUCKET_V3: The bucket to hold deposited NMR data.
- prod: article-pipeline-deposition-prod
- dev: article-pipeline-deposition-development
- Settings…
  - Object ownership: ACLs disabled
  - Block all public access
  - Bucket Versioning: enabled on prod, disabled on development
  - Encryption type: Server-side encryption with Amazon S3 managed keys (SSE-S3) and Enable bucket key
  - Object Lock: Disable
- In the prod bucket, after it’s created select it and go to “Management”. Here you should click “Create lifecycle rule”
  - Rule scope: Apply to all objects in the bucket
  - Lifecycle rule actions:
    - Rule name: TrashRetention
    - Permanently delete noncurrent versions of objects: Enabled
    - Days after objects become noncurrent: 30
    - Number of newer versions to retain: Leave blank
S3_RSS_INGESTION_ARCHIVE_URL: Holds an archive of ingested publication information (prod only)
- DEPOSITON_CHART_BUCKET: A bucket to hold deposition charts (prod only).
  - prod: article-pipeline-npa-ingestion-rss-archive
  - dev:
  - Object ownership: ACLs disabled
  - Enable blocking for only… (this could probably be adjusted)
    - Block public access to buckets and objects granted through new access control lists (ACLs)
    - Block public access to buckets and objects granted through new public bucket or access point policies
  - Bucket Versioning: disable
  - Encryption type: Server-side encryption with Amazon S3 managed keys (SSE-S3) and Disable bucket key
  - Object Lock: Disable

AWS_STORAGE_BUCKET_NAME: Holds static files for django admin system (note: django requires this exact name to be used for static files, which is why it’s so vague). Currently this bucket is shared between development and prod deployments since it simply passes the same static config files to both.

prod/dev: article-pipeline-django-static
Object ownership: ACLs disabled
DO NOT Block all public access
Bucket Versioning: disable
Encryption type: Server-side encryption with Amazon S3 managed keys (SSE-S3) and Disable bucket key
Object Lock: Disable

Once created you will need to add a specific bucket policy. This can be added like such…

  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Sid": "Statement1",
              "Effect": "Allow",
              "Principal": "*",
              "Action": [
                  "s3:PutObject",
                  "s3:PutObjectAcl",
                  "s3:GetObject",
                  "s3:GetObjectAcl",
                  "s3:DeleteObject"
              ],
              "Resource": [
                  "arn:aws:s3:::BUCKET_NAME", # REPLACE BUCKET NAME
                  "arn:aws:s3:::BUCKET_NAME/static", # REPLACE BUCKET NAME
                  "arn:aws:s3:::BUCKET_NAME/*" # REPLACE BUCKET NAME
              ]
          }
      ]
  }

DEPOSITON_CHART_BUCKET: A bucket to hold deposition charts (prod only).
- prod: article-pipeline-deposition-charts-prod
- dev:
- Object ownership: ACLs disabled
- DO NOT Block all public access
- Bucket Versioning: disable
- Encryption type: Server-side encryption with Amazon S3 managed keys (SSE-S3) and Disable bucket key
- Object Lock: Disable

Create Copilot Application / Specific Environment

Create an application in copilot. This will hold all resources that are used by the deposition system (frontend, backend, etc.).

Make sure you’re in the ROOT DIRECTORY of the article_pipeline repo on your local machine.

Create the copilot application…

copilot app init article-pipeline

Now create environments for prod and development. Note that it’s specifically “prod” and “development” because this is how it was mistaken initialized and changing it would be too involved.

copilot env init --name development

copilot env init --name prod

Note that we want to utilize the VPC we’ve already setup (prod and development share these in the current deployment). As such select No, I'd like to import existing resources when deploying. Then make sure you select… - VPC: The VPC you created (named) - Public Subnets: Press “right” on your keyboard to select BOTH as they were previously configured - Private Subnets: Again, select both.

Add Parameters to Parameter Store

Variables that need to be kept secure (i.e. API Keys, passwords) but that our services need to have access to are stored in the AWS parameter store, per the standard way to do with copilot.

The parameter store can be found in the Systems Manager -> Parameter Store.

For a fresh deployment it’s probably easiest to push variables directly from the command line. Consult the “Parameters” section above for information about how you should be defining all of your values. Make sure you define a parameter value for all “secrets” in each copilot-application / copilot-environment (i.e. backend/development) before you try and deploy it.

Copilot Deployment

Copilot will automatically read the files in the copilot/ directory of the article_pipeline repo. They will be used to guide it on how to build all of the infrastructure to complete a successful deployment. For np deposition the most important components are backend and frontend, which will deploy a functional version of the website. There are other sub-services that need to be deployed that will be discussed later.

Fresh Deployment

You will need to initiate the following to deploy your copilot application…

App (article-pipeline)
Env (prod, development, etc.)
Services (backend, frontend, etc.)

App

First you must initialize the app and provide the correct name. You will need to associated the domain name with your specific application so make sure to do that as well.

copilot app init --domain npdeposition.org

Run it again and confirm the domain association is correct.

Env

Now you must initiate your development and prod environment. This will allow you to deploy to them independently. This way you can update development and prod separately so that you can test your changes in a deployed environment before they go live on prod.

copilot env init --name development
copilot env init --name prod

And then deploy these envs to AWS.

copilot env deploy --name development
copilot env deploy --name prod

Services (svc)

Now you must initiate a service for the “backend” and “frontend” as well as all other microservices that may be used.

You may deploy each of them as Load Balanced Web Service since that is how they are configured to work.

copilot svc init # Run for backend, frontend, etc.

Finally you will be able to actually deploy your application and push your repo to AWS, where it will be built and deployed into appropriate services.

Make sure your local docker desktop is open and running and run this. You will need to deploy each combo of svc (i.e. backend/frontend) and env (i.e. development / prod) separately.

It is recommended that you first deploy the backend and then the frontend since they have problems talking to each other if deployed in the other order.

After the initial deployment is complete you should be ok to push both at the same time whenever you make new deployments.

copilot svc deploy # Once for each svc/env combo

Note the frontend is prone to network issues so if it ever fails to deploy sometimes trying again will solve the issue, especially if you get an error message like this one…

ERROR: failed to build: failed to solve: process "/bin/sh -c yarn install --frozen-lockfile" did not complete successfully: exit code: 1

Backups (AWS Backups)

Remember that there are two separate components that store core deposition information, a database (for short “information”, think spreadsheets) as well as s3 buckets to store files. Amazon Relational Database Services (RDS) has built in systems to periodically backup data as outlined in the database setup section. However, for s3 buckets a separate system must be configured.

This is achieved using the AWS Backups service to store backups of the production deposition bucket (stores deposited NMR Data). There’s probably better ways to do this but this is a current valid configuration. This will backup the entire bucket periodically and place it into “cold storage”. This is essentially just a cheap way to store large values of data with the caveat that if it ever must be restored that it will take a long time.

Go to AWS Backup -> Vaults -> Create Vault

Give it a clear name
Setup a “Backup Vault”
Use a default encryption key

Now go back to the AWS Backup home -> Backup plans -> Create a new backup plan

Use informative names for the backup plan and rules
Setup the backup to occur every month at least
Pick a time in the middle of the night for the backup to initialize and use default settings for “start within” and “complete within”.
I chose not to use point-in-time recovery, however, this may be a useful (if slightly more expensive) option
Check “Move backups from warm to cold storage” and set time in warm storage to 2 days
Set retention period to something reasonable (I chose 180 days)
No need for backup indexes

Assign Resources

Use the default IAM role
Choose “include specific resource types”
Select “S3” and “RDS” and backup the database and production NMR s3 bucket. This way you’ll have a backup of both that are in sync with one another.

Now, for the S3 bucket backup to work it will need specific permissions to be enabled. Got to the policy of any buckets you’ll be backing up and configure them like this…

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAWSBackupToReadBucket",
      "Effect": "Allow",
      "Principal": {
        "Service": "backup.amazonaws.com"
      },
      "Action": [
        "s3:GetBucketLocation",
        "s3:ListBucket",
        "s3:GetBucketVersioning"
      ],
      "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME"
    },
    {
      "Sid": "AllowAWSBackupToReadObjects",
      "Effect": "Allow",
      "Principal": {
        "Service": "backup.amazonaws.com"
      },
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion"
      ],
      "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*"
    }
  ]
}

Deploying the NER tool

Note this service exists within the article_pipeline repo.

This should be very straightforward. I recommend sticking to just deploying on prod and directing traffic from your development services to your prod system so they can share the same resource. If the NER tool eventually needs to undergo frequent changes or utilize A/B testing then you could think about swapping to an alternate configuration.

For deployment simply run…

copilot svc init

Make sure to select a “Load Balanced Web Service”. Let the initiation complete then simply run…

copilot svc deploy

Select nertool and make sure you select prod for your environment. The deployment should go live at the url that you specify in copilot/nertool/manifest.yml. That’s it.

Deploying the ML Tool

Note this service exists within the article_pipeline repo.

Again, this should be very straightforward. Simply run the same copilot svc init and copilot svc deploy commands but make sure that you select the ML tool. Only a prod deployment should be necessary.

The copilot configuration can be found in copilot/nertool/manifest.yml

Note that due to fargate compatibility requirements a specific, CPU-bound version of pytorch is included in the requirements.txt.

Literature Proxy Service

Note this service exists in its own repo entirely ()