An organization currently runs a large Hadoop environment in their data center and is in the process of creating an
alternative Hadoop environment on AWS, using Amazon EMR. They generate around 20 TB of data on a monthly
basis. Also on a monthly basis, files need to be grouped and copied to Amazon S3 to be used for the Amazon EMR
environment. They have multiple S3 buckets across AWS accounts to which data needs to be copied. There is a
10G AWS Direct Connect setup between their data center and AWS, and the network team has agreed to allocate
50% of AWS Direct Connect bandwidth to data transfer. The data transfer cannot take more than two days. What
would be the MOST efficient approach to transfer data to AWS on a monthly basis?
An organization is developing a mobile social application and needs to collect logs from all devices on which it is
installed. The organization is evaluating the Amazon Kinesis Data Streams to push logs and Amazon EMR to
process data. They want to store data on HDFS using the default replication factor to replicate data among the
cluster, but they are concerned about the durability of the data. Currently, they are producing 300 GB of raw data
daily, with additional spikes during special events. They will need to scale out the Amazon EMR cluster to match
the increase in streamed data. Which solution prevents data loss and matches compute demand?