I want to run an HDFS cluster on AWS where I can store the data that needs to be processed using my custom application running on EC2 instances. AWS EMR is the only way I could find to create an HDFS cluster on AWS. There are tutorials available on the web to create HDFS cluster using EC2 instances. But, if I use EC2 instances, I run the risk of losing the data when I shut down the instances.
What I need is:
1. An HDFS cluster that can be shut down when not in use.
2. When shut down, data should remain persisted.
There is a solution that says I can keep my data in S3 bucket and load it everytime I start the EMR cluster. However, this is repetitive and a huge overhead specially if the data is huge.
In GCP, I used DataProc cluster which satisfied the above two criteria. Shutting down the cluster at least saved the cost of VMs and I only paid for storage when not using the HDFS cluster. I am wondering if there is some similar way in AWS.