Skip to content

Pasted image 20221101113016.png

Amazon S3

  • object based storage
  • max storage 5 TB
  • if uploading more than 5gb use multi upload
  • also has metadata, tags and optional version id
  • private by default
  • read after write consistency
  • 3,5k writes and 5k puts per second

S3 vs EBS

Pasted image 20230517135805.png

Requester Pays

  • Requester pays network and downloads cost
  • use to share large data sets with other accounts
  • user must be authorized in AWS

S3 event notifications

  • trigger an action (lambda) on api request to the bucket
  • can also use event bride for more targets and filters
    • Event Bridge Capabilities - Archive, replay events, reliable delivery
    • Multiple destinations - Step Functions, Kinesis Streams / Firehose

Targets

S3 Access Logs

  • monitor traffic requests to an S3 bucket
  • can be used for access and security audits
  • understand s3 bill

MFA-protected API Access

  • enforce MFA for access to S3 resources

MFA delete

  • enforce MFA for delete object action

Versioning

  • once object versioning is enabled , it can't be disabled - only suspended
    • when versioning is enabled each object has a
      • key = name & id incremental int
      • by default uses "delete marker" (soft delete that hides all prev versions)
  • when object versioning is suspended , AWS still charges you for all the object versions previously generated.
  • if an object is updated the old version still persists within the bucket
  • if an object is deleted the new version will be a delete marker, and the old version will still persist within the bucket

Commands

S3 Sync

  • copies objects between buckets
  • does diff
  • used last modified to update
  • if versioned only copies current version
    aws s3 sync s3://DOC-EXAMPLE-BUCKET-SOURCE s3://DOC-EXAMPLE-BUCKET-TARGET
    

Storage classes

  • can transition to infrequent access only after 30 days
  • all classes have 99.99% availability and Durability (11 9's Durability)
  • EXCEPT One Zone IA (99,5 Availability, 11 9's Durability)

| Storage Class |Life cycle | Use Cases | Minimum Capacity Charge | Minimum Storage Duration Charge | Retrieval Fee | First Byte Latency | Availability Zones Supported | |------------------------------|-----------------------------------------------------------|-------------------------|--------------------------------|----------------|--------------------|--------------------------------------------| | S3 Standard | > * (all other s3 storage tiers) | Frequently accessed data that requires low-latency. | None | None | None | Milliseconds | At least 3 | | S3 Standard-IA | > Intelligent-Tiering > * | Backups, disaster recovery, and infrequently accessed data | 128 KB | 30 days | Per Gb | Milliseconds | At least 3 | | S3 Intelligent-Tiering | > One Zone-IA > * | Data with changing access patterns or unpredictable usage | None | 30 days | None | Milliseconds | At least 3 | | S3 One Zone-IA | Glacier Flexible Retrieval > * | Storing secondary backups or replicas of easily reproduced data | 128 KB | 30 days | Per Gb | Milliseconds | Single Availability Zone | | S3 Glacier | Glacier Flexible Retrieval > * | Archiving rarely accessed data with long-term retention | 40 KB | 90 days | Per Gb | Minutes to hours | At least 3 | | S3 Glacier Instant Retrieval | Glacier Flexible Retrieval > * | Rapid access to archived data with a slight increase in cost | 40 KB | 90 days | Per Gb | Milliseconds | At least 3 | | S3 Glacier Flexible Retrieval| Glacier Deep Archive | Bulk retrieval of large amounts of data with lower costs | 40 KB | 90 days | Per Gb | Minutes to hours | At least 3 | | S3 Glacier Deep Archive | No transitions | Long-term data retention with extremely low-cost requirements | 40 KB | 180 days | Per Gb | 12 hours - 48 hour | At least 3 | | S3 Outposts | No transitions | Hybrid cloud storage extending S3 to on-premises infrastructure | None | None | None | Milliseconds | Depends on the Outpost configuration |

Storage Class Use Cases Ideal Use Case Scenario
S3 Standard -Frequently accessed data.
-Important data
-Non replaceable
Dynamic website content, mobile applications, real-time analytics, content distribution
S3 Intelligent-Tiering -long-lived data, with changing or unknow patterns Data with varying access frequencies, cost optimization for unknown access patterns
S3 Standard-IA -Long lived data
-Infrequent accessed data
Long-term storage of infrequently accessed data, compliance and regulatory data
S3 One Zone-IA -Long lived data
-Infrequent
-NON-Critical data
-Replaceable data
Secondary copies of on-premises data, transitional data or short-term storage
S3 Glacier Instant Retrieval -Long lived data
-rarely accessed e.g once per qrt.
Rapid retrieval of archived data for urgent access or frequent retrieval
S3 Glacier Flexible Retrieval -archival data where frequent or realtime access isn't needed (e.g yearly) Bulk retrieval of large datasets, data migration, data analysis
S3 Glacier Deep Archive -archival data that rearly if ever needs to be accessed (e.g Legal or regulation data storage) Long-term data archiving, digital preservation, compliance data
S3 Outposts Hybrid cloud storage extending S3 to on-premises infrastructure Hybrid cloud environments, applications with data residency requirements

S3 Standard

  • Default AWS storage class that's used in S3, should be user default as well.
  • S3 Standard is region resilient, and can tolerate the failure of an AZ.
  • can transition to all s3 tiers.
  • Objects are replicated to at least 3+ AZs when they are uploaded.
  • 99.999999999% durability
  • 99.99% availability
  • Offers low latency and high throughput.
  • No minimums, delays, or penalties.
  • Billing is storage fee, data transfer fee, and request based charge.

All of the other storage classes trade some of these compromises for another.

S3 Standard-IA

Designed for data that isn't accessed often, long term storage, backups, disaster recovery files. The requirement for data to be safe is most important.

  • Designed for less frequent rapid access when it is needed.
  • can transition to:
    • S3 Intelligent Tiering
    • One Zone IA
    • Glacier Instant retrieval
    • Glacier Flexible retrieval
    • Glacier Deep Archive.
  • Cheaper rate to store data you will rarely need, but if you do need it, you need it quickly.
  • ~54% cheaper than S3 standard.
  • Minimum 128KB charge for each object.
    • Cost benefits might be negated for smaller objects.
  • 30 days minimum duration charge per object.
  • Retrieval fee for every GB of data retrieved from this class.
  • 99.9% availability, slightly lower than standard S3.

One Zone-IA

Great choice for secondary copies of primary data or backup copies.

If data is easily creatable from a primary data set, this would be a great place to store the output from another data set.

  • Designed for data that is accessed less frequently but needed quickly.
  • can transition to Glacier Flexible retrieval & Glacier Deep Archive.
  • 80% of the base cost of Standard-IA.
  • Same minimum size and duration fee as Standard-IA
  • Data is only stored in a single AZ, no 3+ AZ replication.
  • 99.5% availability, lower than Standard-IA

S3 Glacier Instant Retrieval

Suitable for applications that require infrequent access to archived data but need fast retrieval times when accessed.

  • Like S3-Standard-IA .. cheaper storage, more expensive retrieval , longer minimum.
  • can transition to Glacier Flexible retrieval & Glacier Deep Archive.
  • Has per GB data retrieval fee, cost increases with frequent data access.
  • should be used for long-lived data, accessed once per qtr with millisecond access.
  • Minimum duration charge of 90 days, objects can be stored for less but min billing always applies.
  • Minimum capacity charge of 128 kb per object.

S3 Glacier Flexible Retrieval

Archival data where frequent or realtime access isn't needed (e.g yearly) Minutes to hours retrieval.

  • can transition to Glacier Deep archive.
  • faster data retrieval, with retrieval times of minutes.
  • Minimum duration charge of 90 days, objects can be stored for less but min billing always applies.
  • Minimum capacity charge of 40 kb per object.
  • objects cannot be made publicly accessible
    • any access of data(beyond object metadata) requires a retrieval process.
  • Retrieval Process first retrieves the data to S3 Standard-IA temporarly.
  • Retrieval Process offers three retrieval options which vary in speeds:
    • Expedited: Data retrieval (1-5 mins)
    • Standard: (3-5 hours)
    • Bulk: (5-12 hours)
  • Faster = More Expensive $ $ $

Provisoned Retrival Capacity

  • guranteed up to 150mbs retrival speed

Glacier Deep Archive

Archival data that rarely if ever needs to be accessed - hours or days for retrieval.

  • Retrieval Process first retrieves the data to S3 Standard-IA temporarly.
  • Retrieval Process offers three retrieval options which vary in speeds:
    • Standard: Data retrieval (12 hours)
    • Bulk: (up to 48 hours)
  • First byte latency = hours or days
  • Objects cannot be made publicly accessible
    • any access of data (beyond object metadata) requires a retrieval process.

Livecycle Rules

Transiton Actions

  • move to diffrent storage tier after time
  • can transition to infrequent access only after 30 days
  • Automates the moving of objects between the different storage tiers.
  • Can be used in conjunction with versioning.
  • Lifecycle rules can be applied to both current and previous versions of an object.
  • These actions can be classified as follows:

    • Transition actions – In which you define when objects transition to another storage class. For example, you may choose to transition objects to the STANDARD_IA (IA, for infrequent access) storage class 30 days after creation, or archive objects to the GLACIER storage class one year after creation.

    • Expiration actions – In which you specify when the objects expire. Then Amazon S3 deletes the expired objects on your behalf.

Transfer Limitations for Livecycle Rules

Pasted image 20221101123650.png

Expiration Actions

  • delte access logs
  • delete old versions of files
  • delte incomplete multi part uploads
  • can apply to paths or full bucket

Tranfer Accelerator

reqs: - bucket needs to be enabled - bucket name can't containt periods in the name - bucket name needs to be DNS compatible - fast transfer over long distances - uses cloudfront edge locations - routes via optimized path

Multipard uploads

  • transmit object in chuncks
  • if one chunk fails, only that cuck needs to be retransmitted
  • over 100mb should be considered
  • 10 000 max parts, each part can range : 5mb - 5 Gb
  • last part can be smaller than 5 Mb
  • parts can fail and be restarted.
  • improved throughput

S3 Bucket Policies

  • add or deny permissions to objects
  • can be attached to users groups or buckets
  • can grant access to other aws accounts
  • can restrict based on multi coditions for the requets (ip, time, ssl )

ACL

  • can also use ACL to grant another account access

S3 Replication

There are two types of S3 replication available.

  • Cross-Region Replication (CRR)
    • Allows the replication of objects from a source bucket to a destination bucket in different AWS regions.
  • Same-Region Replication (SRR)
    • Allows the replication of objects from a source bucket to a destination bucket in the same AWS region.

Why use replication

  • SRR
  • Log Aggregation SRR
  • Sync production and test accounts SRR
  • Resilience with strict Sovereignty requirements CRR
  • Global resilience improvements CRR
  • Latency reduction
  • must have versioning in both buckets
  • can be cross region or same region
  • can be bucket in other account
  • async
  • use batch replication to replicate existing ones
  • can't chain replication

Encryption

Buckets aren't encrypted, objects are. Multiple objects in a bucket can use a different encryption methods.

Client-Side encryption

  • Objects being encrypted by the client before they leave.
  • Data being sent the whole time it is sent as cypher text.
  • AWS has no way to see into the data.
  • The encryption burden is on the customer and not AWS.

Server-Side encryption

  • Data is encrypted in transit using HTTPS
  • Data inside the tunnel is still in its original unencrypted form.
  • Data reaches S3 server in plain text form.
  • After S3 sees the data, it is then encrypted.
  • AWS will handle some or all of these processes.
Encryption Option Use case Customer Responsabilities AWS Responsibilities Tradeoffs
SSE-S3-default - Provides encryption at rest using S3 managed keys.
- provides no admin overhead.
-No additional key management tasks required. -encryption.
-decryption.
-key management (generation & rotation)
-Limited control ** over the encryption keys.
-Keys are
managed by AWS**.
SSE-KMS - Provides encryption at rest using AWS KMS managed keys. -Configure and manage the KMS keys -encryption.
-decryption.
-key management (generation & rotation)
-more control and additional security features through KMS.
-Customer manages and controls the KMS keys.
SSE-C (with Customer-Provided Keys) -flexible.
-customer need to generate & rotate keys themselves.
- dont want to handle encryption and decrytion on client side.
-Generate.
-manage.
- generate encryption keys.
-rotate keys.
-encryption.
-decryption.
-allow to offload cpu usage on client side.
- provides complete control and ownership of key management.
-good for heavy regulated enviorments.

Pasted image 20230517155855.png

Server-side Encryption with Amazon S3 - SEE S3 (AES-256) - default

  • aws managed keys
  • Server Side encryption of objects
  • aws manged
  • replication by default

SSE-S3 Caveats

  • Not good for regulatory environment where keys and access must be controlled.
  • No control key material rotation.
  • No role separation.
    • A full S3 admin can decrypt data and open objects.

SEE-C

  • Customer is responsible for the keys themselves.
  • Customer still needs to generate and manage the key.
  • can use same key for all objects or individual keys for every single object.
  • S3 services manages the actual encryption and decryption
    • Offloads CPU requirements for encryption.
  • S3 will see the unencrypted object throughout this process.

SSE KMS

  • Need to specify need key for object in new bucket
  • need IAM role to decrypt with source key and encrypt with new key
  • might get KMS throttling errors, need to ask for service quotas increase
  • can use multi region kms keys

Bucket keys

  • offload some of the work when using in conjunction with KMS
  • cloud trail kms events now show the bucket not the object
  • works with replication .. the object encryption is maintained
  • if replicating plaintext to a bucket using bucket keys the object is encrypted at the destination side (ETAG changes)

Bucket Policies

  • common use cases:
    • grant access to other AWS accounts
    • anonymous access to a bucket.
  • Identity policies are limited to single owner account
  • Resource Policy allow/deny same or different accounts
    • allow /deny anonymous principals

Security Hints

hints to choose one over the other , based on the scenario.

  • Identity: controlling dif resources
  • Identity: you have a preference for IAM
  • Identity: Same account
  • Bucket: when you need to only control the security of s3
  • bucket: anonymnous or cross-account
  • ACLs: NEVER - unless you must

Use cases

  • compliance
  • latency
  • replication across accounts
  • log aggregation
  • sync data to test account from prod

Performance

  • 100-200ms
  • 3,5k PUT/COPY/POST/DELETE prefix
  • 5,5k GET/HEAD per prefix
  • use multi part upload for parallel upload
    • recommended for files > 100 MB
    • MUST use for files > 5Gb
  • S3 transfer acceleration to upload to edge location instead of directly to s3, also use for download if files are bigger than 1GB
  • byte range fetches
    • parallelize GETs by requesting specific byte ranges.
    • better resilience in case of failures - only retry small byte range
    • allow to speed up downloads
    • can be used to retrieve only partial data (e.g head of a file)
    • multi part download

Select & Glacier Select

  • Retrieve less data using SQL by performing server-side filtering
  • can filter by rows & columns (simple SQL statements)
  • less network transfer
  • less CPU cost client-side.
  • filter data server side to reduce network cost (100 lines out of 1 mil lines csv)

Pasted image 20230530185528.png

Batch operations

  • modify all metadata and properties
  • copy objects between buckets
  • encrypt all unencrypted objects
  • modify ACL and tags
  • restore objects from glacier
  • invoke lambda for each object
  • use s3 select to filter then batch
  • S3 Batch operations manages:
    • retries
    • tracks progress
    • sends completion notifications
    • generate reports

Vault

  • Objects can not be modified or deleted
  • used for compliance
  • Can lock the vault policy from future edits

Server Access Logging

  • If CloudTrail is not enough information you use this

Features

  • including referer
  • including turn around time

Pre-Signed URLs

  • URL Expiration
    • S3 Console - 1 min to 720 mins (12 hours)
    • aws cli - 604800 secs - 168 hours (default 3600)
    • users given a pre-signed url inherit the permissionso f the user that generated the url for GET / PUT.
  • use cases
    • Allow only logged-in users to download a premium video from your S3 bucket.
    • allow an ever-changing list of users to download files by generating URLs dynamically.
    • allow temporally a user to upload a file to a precise location in your s3 bucket.

Access Points (AP)

  • simplify security management for s3 buckets
  • each AP has
    • its own DNS name (Internet origin or VPC origin)
    • an Access point policy (similar to a bucket policy) -
      • manage security at scale

Pasted image 20230530151206.png

Best practices

  1. Use the appropriate AWS Region:
    • Choose the AWS Region closest to your users or application to minimize latency and improve performance.
  2. Select the right S3 storage class:
    • Choose the appropriate S3 storage class based on your data access patterns and requirements.
    • For frequently accessed data, use S3 Standard or S3 Intelligent-Tiering.
    • For infrequently accessed or long-term archival data, consider S3 Standard-IA, S3 One Zone-IA, or S3 Glacier.
  3. Enable S3 Transfer Acceleration:
    • S3 Transfer Acceleration uses optimized edge locations to speed up data uploads and downloads.
    • Enable it for faster transfer speeds, especially for large files or over long distances.
  4. Optimize object key names:
    • Avoid using sequential or timestamp-based object key names as it can lead to performance limitations.
    • Use a random or hashed naming pattern to distribute objects across multiple partitions for better performance.
  5. Leverage S3 multipart upload:
    • For large file uploads (typically over 100 MB), use multipart upload to improve performance and resiliency.
    • Multipart upload allows parallelization of upload parts and supports resumable uploads.
  6. Enable S3 byte-range fetches:
    • For applications that need to retrieve partial objects, use byte-range fetches to fetch only the required portions of an object.
    • This reduces the amount of data transferred, improving performance and reducing costs.
  7. Enable S3 Transfer Manager:
    • Use the AWS SDK's S3 Transfer Manager to optimize file transfers by parallelizing the upload or download of multiple parts.
  8. Utilize S3 Select and Glacier Select:
    • S3 Select allows you to retrieve only the required data from objects using SQL-like queries, reducing data transfer and processing time.
    • Glacier Select enables data retrieval from S3 Glacier archives based on specific query criteria, reducing retrieval times.
  9. Enable S3 Cross-Region Replication (CRR):
    • If you have users or applications in different regions, enable S3 CRR to replicate data across regions for faster access.
  10. Monitor and optimize bucket performance:
    • Monitor S3 metrics, such as request latency, request rates, and data transfer rates, to identify performance bottlenecks.
    • Consider adjusting bucket configurations, such as increasing provisioned throughput, to optimize performance.