Amazon S3¶
- object based storage
- max storage 5 TB
- if uploading more than 5gb use multi upload
- also has metadata, tags and optional version id
- private by default
- read after write consistency
- 3,5k writes and 5k puts per second
S3 vs EBS¶
Requester Pays¶
- Requester pays network and downloads cost
- use to share large data sets with other accounts
- user must be authorized in AWS
S3 event notifications¶
- trigger an action (lambda) on api request to the bucket
- can also use event bride for more targets and filters
- Event Bridge Capabilities - Archive, replay events, reliable delivery
- Multiple destinations - Step Functions, Kinesis Streams / Firehose
Targets¶
- Lambda
- SNS
- docs/messaging/SQS
- EventBride
- can take more than 1 min to trigger
S3 Access Logs¶
- monitor traffic requests to an S3 bucket
- can be used for access and security audits
- understand s3 bill
MFA-protected API Access¶
- enforce MFA for access to S3 resources
MFA delete¶
- enforce MFA for delete object action
Versioning¶
- once object versioning is enabled , it can't be disabled - only suspended
- when versioning is enabled each object has a
- key = name & id incremental int
- by default uses "delete marker" (soft delete that hides all prev versions)
- when versioning is enabled each object has a
- when object versioning is suspended , AWS still charges you for all the object versions previously generated.
- if an object is updated the old version still persists within the bucket
- if an object is deleted the new version will be a delete marker, and the old version will still persist within the bucket
Commands¶
S3 Sync¶
- copies objects between buckets
- does diff
- used last modified to update
- if versioned only copies current version
aws s3 sync s3://DOC-EXAMPLE-BUCKET-SOURCE s3://DOC-EXAMPLE-BUCKET-TARGET
Storage classes¶
- can transition to infrequent access only after 30 days
- all classes have 99.99% availability and Durability (11 9's Durability)
- EXCEPT One Zone IA (99,5 Availability, 11 9's Durability)
| Storage Class |Life cycle | Use Cases | Minimum Capacity Charge | Minimum Storage Duration Charge | Retrieval Fee | First Byte Latency | Availability Zones Supported | |------------------------------|-----------------------------------------------------------|-------------------------|--------------------------------|----------------|--------------------|--------------------------------------------| | S3 Standard | > * (all other s3 storage tiers) | Frequently accessed data that requires low-latency. | None | None | None | Milliseconds | At least 3 | | S3 Standard-IA | > Intelligent-Tiering > * | Backups, disaster recovery, and infrequently accessed data | 128 KB | 30 days | Per Gb | Milliseconds | At least 3 | | S3 Intelligent-Tiering | > One Zone-IA > * | Data with changing access patterns or unpredictable usage | None | 30 days | None | Milliseconds | At least 3 | | S3 One Zone-IA | Glacier Flexible Retrieval > * | Storing secondary backups or replicas of easily reproduced data | 128 KB | 30 days | Per Gb | Milliseconds | Single Availability Zone | | S3 Glacier | Glacier Flexible Retrieval > * | Archiving rarely accessed data with long-term retention | 40 KB | 90 days | Per Gb | Minutes to hours | At least 3 | | S3 Glacier Instant Retrieval | Glacier Flexible Retrieval > * | Rapid access to archived data with a slight increase in cost | 40 KB | 90 days | Per Gb | Milliseconds | At least 3 | | S3 Glacier Flexible Retrieval| Glacier Deep Archive | Bulk retrieval of large amounts of data with lower costs | 40 KB | 90 days | Per Gb | Minutes to hours | At least 3 | | S3 Glacier Deep Archive | No transitions | Long-term data retention with extremely low-cost requirements | 40 KB | 180 days | Per Gb | 12 hours - 48 hour | At least 3 | | S3 Outposts | No transitions | Hybrid cloud storage extending S3 to on-premises infrastructure | None | None | None | Milliseconds | Depends on the Outpost configuration |
Storage Class | Use Cases | Ideal Use Case Scenario |
---|---|---|
S3 Standard | -Frequently accessed data. -Important data -Non replaceable |
Dynamic website content, mobile applications, real-time analytics, content distribution |
S3 Intelligent-Tiering | -long-lived data, with changing or unknow patterns | Data with varying access frequencies, cost optimization for unknown access patterns |
S3 Standard-IA | -Long lived data -Infrequent accessed data |
Long-term storage of infrequently accessed data, compliance and regulatory data |
S3 One Zone-IA | -Long lived data -Infrequent -NON-Critical data -Replaceable data |
Secondary copies of on-premises data, transitional data or short-term storage |
S3 Glacier Instant Retrieval | -Long lived data -rarely accessed e.g once per qrt. |
Rapid retrieval of archived data for urgent access or frequent retrieval |
S3 Glacier Flexible Retrieval | -archival data where frequent or realtime access isn't needed (e.g yearly) | Bulk retrieval of large datasets, data migration, data analysis |
S3 Glacier Deep Archive | -archival data that rearly if ever needs to be accessed (e.g Legal or regulation data storage) | Long-term data archiving, digital preservation, compliance data |
S3 Outposts | Hybrid cloud storage extending S3 to on-premises infrastructure | Hybrid cloud environments, applications with data residency requirements |
S3 Standard¶
- Default AWS storage class that's used in S3, should be user default as well.
- S3 Standard is region resilient, and can tolerate the failure of an AZ.
- can transition to all s3 tiers.
- Objects are replicated to at least 3+ AZs when they are uploaded.
- 99.999999999% durability
- 99.99% availability
- Offers low latency and high throughput.
- No minimums, delays, or penalties.
- Billing is storage fee, data transfer fee, and request based charge.
All of the other storage classes trade some of these compromises for another.
S3 Standard-IA¶
Designed for data that isn't accessed often, long term storage, backups, disaster recovery files. The requirement for data to be safe is most important.
- Designed for less frequent rapid access when it is needed.
- can transition to:
- S3 Intelligent Tiering
- One Zone IA
- Glacier Instant retrieval
- Glacier Flexible retrieval
- Glacier Deep Archive.
- Cheaper rate to store data you will rarely need, but if you do need it, you need it quickly.
- ~54% cheaper than S3 standard.
- Minimum 128KB charge for each object.
- Cost benefits might be negated for smaller objects.
- 30 days minimum duration charge per object.
- Retrieval fee for every GB of data retrieved from this class.
- 99.9% availability, slightly lower than standard S3.
One Zone-IA¶
Great choice for secondary copies of primary data or backup copies.
If data is easily creatable from a primary data set, this would be a great place to store the output from another data set.
- Designed for data that is accessed less frequently but needed quickly.
- can transition to Glacier Flexible retrieval & Glacier Deep Archive.
- 80% of the base cost of Standard-IA.
- Same minimum size and duration fee as Standard-IA
- Data is only stored in a single AZ, no 3+ AZ replication.
- 99.5% availability, lower than Standard-IA
S3 Glacier Instant Retrieval¶
Suitable for applications that require infrequent access to archived data but need fast retrieval times when accessed.
- Like S3-Standard-IA .. cheaper storage, more expensive retrieval , longer minimum.
- can transition to Glacier Flexible retrieval & Glacier Deep Archive.
- Has per GB data retrieval fee, cost increases with frequent data access.
- should be used for long-lived data, accessed once per qtr with millisecond access.
- Minimum duration charge of 90 days, objects can be stored for less but min billing always applies.
- Minimum capacity charge of 128 kb per object.
S3 Glacier Flexible Retrieval¶
Archival data where frequent or realtime access isn't needed (e.g yearly) Minutes to hours retrieval.
- can transition to Glacier Deep archive.
- faster data retrieval, with retrieval times of minutes.
- Minimum duration charge of 90 days, objects can be stored for less but min billing always applies.
- Minimum capacity charge of 40 kb per object.
- objects cannot be made publicly accessible
- any access of data(beyond object metadata) requires a retrieval process.
- Retrieval Process first retrieves the data to S3 Standard-IA temporarly.
- Retrieval Process offers three retrieval options which vary in speeds:
- Expedited: Data retrieval (1-5 mins)
- Standard: (3-5 hours)
- Bulk: (5-12 hours)
- Faster = More Expensive $ $ $
Provisoned Retrival Capacity¶
- guranteed up to 150mbs retrival speed
Glacier Deep Archive¶
Archival data that rarely if ever needs to be accessed - hours or days for retrieval.
- Retrieval Process first retrieves the data to S3 Standard-IA temporarly.
- Retrieval Process offers three retrieval options which vary in speeds:
- Standard: Data retrieval (12 hours)
- Bulk: (up to 48 hours)
- First byte latency = hours or days
- Objects cannot be made publicly accessible
- any access of data (beyond object metadata) requires a retrieval process.
Livecycle Rules¶
Transiton Actions¶
- move to diffrent storage tier after time
- can transition to infrequent access only after 30 days
- Automates the moving of objects between the different storage tiers.
- Can be used in conjunction with versioning.
- Lifecycle rules can be applied to both current and previous versions of an object.
-
These actions can be classified as follows:
-
Transition actions – In which you define when objects transition to another storage class. For example, you may choose to transition objects to the STANDARD_IA (IA, for infrequent access) storage class 30 days after creation, or archive objects to the GLACIER storage class one year after creation.
-
Expiration actions – In which you specify when the objects expire. Then Amazon S3 deletes the expired objects on your behalf.
-
Transfer Limitations for Livecycle Rules¶
Expiration Actions¶
- delte access logs
- delete old versions of files
- delte incomplete multi part uploads
- can apply to paths or full bucket
Tranfer Accelerator¶
reqs: - bucket needs to be enabled - bucket name can't containt periods in the name - bucket name needs to be DNS compatible - fast transfer over long distances - uses cloudfront edge locations - routes via optimized path
Multipard uploads¶
- transmit object in chuncks
- if one chunk fails, only that cuck needs to be retransmitted
- over 100mb should be considered
- 10 000 max parts, each part can range : 5mb - 5 Gb
- last part can be smaller than 5 Mb
- parts can fail and be restarted.
- improved throughput
S3 Bucket Policies¶
- add or deny permissions to objects
- can be attached to users groups or buckets
- can grant access to other aws accounts
- can restrict based on multi coditions for the requets (ip, time, ssl )
ACL¶
- can also use ACL to grant another account access
S3 Replication¶
There are two types of S3 replication available.
- Cross-Region Replication (CRR)
- Allows the replication of objects from a source bucket to a destination bucket in different AWS regions.
- Same-Region Replication (SRR)
- Allows the replication of objects from a source bucket to a destination bucket in the same AWS region.
Why use replication¶
- SRR
- Log Aggregation SRR
- Sync production and test accounts SRR
- Resilience with strict Sovereignty requirements CRR
- Global resilience improvements CRR
- Latency reduction
- must have versioning in both buckets
- can be cross region or same region
- can be bucket in other account
- async
- use batch replication to replicate existing ones
- can't chain replication
Encryption¶
Buckets aren't encrypted, objects are. Multiple objects in a bucket can use a different encryption methods.
Client-Side encryption¶
- Objects being encrypted by the client before they leave.
- Data being sent the whole time it is sent as cypher text.
- AWS has no way to see into the data.
- The encryption burden is on the customer and not AWS.
Server-Side encryption¶
- Data is encrypted in transit using HTTPS
- Data inside the tunnel is still in its original unencrypted form.
- Data reaches S3 server in plain text form.
- After S3 sees the data, it is then encrypted.
- AWS will handle some or all of these processes.
Encryption Option | Use case | Customer Responsabilities | AWS Responsibilities | Tradeoffs |
---|---|---|---|---|
SSE-S3-default | - Provides encryption at rest using S3 managed keys. - provides no admin overhead. |
-No additional key management tasks required. | -encryption. -decryption. -key management (generation & rotation) |
-Limited control ** over the encryption keys. -Keys are managed by AWS**. |
SSE-KMS | - Provides encryption at rest using AWS KMS managed keys. | -Configure and manage the KMS keys | -encryption. -decryption. -key management (generation & rotation) |
-more control and additional security features through KMS. -Customer manages and controls the KMS keys. |
SSE-C (with Customer-Provided Keys) | -flexible. -customer need to generate & rotate keys themselves. - dont want to handle encryption and decrytion on client side. |
-Generate. -manage. - generate encryption keys. -rotate keys. |
-encryption. -decryption. |
-allow to offload cpu usage on client side. - provides complete control and ownership of key management. -good for heavy regulated enviorments. |
Server-side Encryption with Amazon S3 - SEE S3 (AES-256) - default¶
- aws managed keys
- Server Side encryption of objects
- aws manged
- replication by default
SSE-S3 Caveats¶
- Not good for regulatory environment where keys and access must be controlled.
- No control key material rotation.
- No role separation.
- A full S3 admin can decrypt data and open objects.
SEE-C¶
- Customer is responsible for the keys themselves.
- Customer still needs to generate and manage the key.
- can use same key for all objects or individual keys for every single object.
- S3 services manages the actual encryption and decryption
- Offloads CPU requirements for encryption.
- S3 will see the unencrypted object throughout this process.
SSE KMS¶
- Need to specify need key for object in new bucket
- need IAM role to decrypt with source key and encrypt with new key
- might get KMS throttling errors, need to ask for service quotas increase
- can use multi region kms keys
Bucket keys¶
- offload some of the work when using in conjunction with KMS
- cloud trail kms events now show the bucket not the object
- works with replication .. the object encryption is maintained
- if replicating plaintext to a bucket using bucket keys the object is encrypted at the destination side (ETAG changes)
Bucket Policies¶
- common use cases:
- grant access to other AWS accounts
- anonymous access to a bucket.
- Identity policies are limited to single owner account
- Resource Policy allow/deny same or different accounts
- allow /deny anonymous principals
Security Hints¶
hints to choose one over the other , based on the scenario.
- Identity: controlling dif resources
- Identity: you have a preference for IAM
- Identity: Same account
- Bucket: when you need to only control the security of s3
- bucket: anonymnous or cross-account
- ACLs: NEVER - unless you must
Use cases¶
- compliance
- latency
- replication across accounts
- log aggregation
- sync data to test account from prod
Performance¶
- 100-200ms
- 3,5k PUT/COPY/POST/DELETE prefix
- 5,5k GET/HEAD per prefix
- use multi part upload for parallel upload
- recommended for files > 100 MB
- MUST use for files > 5Gb
- S3 transfer acceleration to upload to edge location instead of directly to s3, also use for download if files are bigger than 1GB
- byte range fetches
- parallelize GETs by requesting specific byte ranges.
- better resilience in case of failures - only retry small byte range
- allow to speed up downloads
- can be used to retrieve only partial data (e.g head of a file)
- multi part download
Select & Glacier Select¶
- Retrieve less data using SQL by performing server-side filtering
- can filter by rows & columns (simple SQL statements)
- less network transfer
- less CPU cost client-side.
- filter data server side to reduce network cost (100 lines out of 1 mil lines csv)
Batch operations¶
- modify all metadata and properties
- copy objects between buckets
- encrypt all unencrypted objects
- modify ACL and tags
- restore objects from glacier
- invoke lambda for each object
- use s3 select to filter then batch
- S3 Batch operations manages:
- retries
- tracks progress
- sends completion notifications
- generate reports
Vault¶
- Objects can not be modified or deleted
- used for compliance
- Can lock the vault policy from future edits
Server Access Logging¶
- If CloudTrail is not enough information you use this
Features¶
- including referer
- including turn around time
Pre-Signed URLs¶
- URL Expiration
- S3 Console - 1 min to 720 mins (12 hours)
- aws cli - 604800 secs - 168 hours (default 3600)
- users given a pre-signed url inherit the permissionso f the user that generated the url for GET / PUT.
- use cases
- Allow only logged-in users to download a premium video from your S3 bucket.
- allow an ever-changing list of users to download files by generating URLs dynamically.
- allow temporally a user to upload a file to a precise location in your s3 bucket.
Access Points (AP)¶
- simplify security management for s3 buckets
- each AP has
- its own DNS name (Internet origin or VPC origin)
- an Access point policy (similar to a bucket policy) -
- manage security at scale
Best practices¶
- Use the appropriate AWS Region:
- Choose the AWS Region closest to your users or application to minimize latency and improve performance.
- Select the right S3 storage class:
- Choose the appropriate S3 storage class based on your data access patterns and requirements.
- For frequently accessed data, use S3 Standard or S3 Intelligent-Tiering.
- For infrequently accessed or long-term archival data, consider S3 Standard-IA, S3 One Zone-IA, or S3 Glacier.
- Enable S3 Transfer Acceleration:
- S3 Transfer Acceleration uses optimized edge locations to speed up data uploads and downloads.
- Enable it for faster transfer speeds, especially for large files or over long distances.
- Optimize object key names:
- Avoid using sequential or timestamp-based object key names as it can lead to performance limitations.
- Use a random or hashed naming pattern to distribute objects across multiple partitions for better performance.
- Leverage S3 multipart upload:
- For large file uploads (typically over 100 MB), use multipart upload to improve performance and resiliency.
- Multipart upload allows parallelization of upload parts and supports resumable uploads.
- Enable S3 byte-range fetches:
- For applications that need to retrieve partial objects, use byte-range fetches to fetch only the required portions of an object.
- This reduces the amount of data transferred, improving performance and reducing costs.
- Enable S3 Transfer Manager:
- Use the AWS SDK's S3 Transfer Manager to optimize file transfers by parallelizing the upload or download of multiple parts.
- Utilize S3 Select and Glacier Select:
- S3 Select allows you to retrieve only the required data from objects using SQL-like queries, reducing data transfer and processing time.
- Glacier Select enables data retrieval from S3 Glacier archives based on specific query criteria, reducing retrieval times.
- Enable S3 Cross-Region Replication (CRR):
- If you have users or applications in different regions, enable S3 CRR to replicate data across regions for faster access.
- Monitor and optimize bucket performance:
- Monitor S3 metrics, such as request latency, request rates, and data transfer rates, to identify performance bottlenecks.
- Consider adjusting bucket configurations, such as increasing provisioned throughput, to optimize performance.