I am calling for the creation of an ec2 instance like so:
aws2 emr create-cluster \
--name "Spark cluster with step" \
--release-label emr-5.24.1 \
--applications Name=Spark \
--log-uri s3://boris-log-bucket/logs/ \
--ec2-attributes KeyName=boris-aws,EmrManagedMasterSecurityGroup=sg-...,EmrManagedSlaveSecurityGroup=sg-... \
--instance-type m5.xlarge \
--instance-count 1 \
--bootstrap-actions Path=s3://boris-set-up/bootstrap_file2.sh \
--steps Type=Spark,Name="Spark job",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--master,yarn] \
--use-default-roles \
--no-auto-terminate
The contents of bootstrap_file2.sh
include:
aws s3 sync s3://boris-scripts/ /home/hadoop/
Which will copy a python script into the ec2 instance.
Now I want to have something along the lines of "Once bootstrap finishes, run the python script"
I've tried two things:
1) I added the line sudo python3 my_script.py
into the bootstrap bash script and it fails
2) I modified the steps to be --steps Type=Spark,Name="Spark job",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--master,yarn,s3://boris-scripts/my_script.py]
and that also fails
My script my_script.py
tries to grab data from Wikipedia and write it to /home/hadoop/ as a csv file