I'm trying to deploy a flask/python app using AWS Elastic Beanstalk and getting a '500 internal server' error resulting from a missing resource. The app works locally but one of the backend components can't find a resource it needs when running on the EC2 instance that Elastic Beanstalk is managing.
I am using the Natural Language Toolkit which I include in my requirements.txt file to be downloaded as a pip package. The nltk package install seems to be have been successful as I'm not getting an error on the line:
import nltk
The line I am getting the error on in my application code is:
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
The error I am getting in my log ending is:
[Wed Feb 14 22:17:10.731016 2018] [:error] [pid 13894] [remote 172.31.0.22:252] Resource \x1b[93mpunkt\x1b[0m not found.
[Wed Feb 14 22:17:10.731018 2018] [:error] [pid 13894] [remote 172.31.0.22:252] Please use the NLTK Downloader to obtain the resource:
[Wed Feb 14 22:17:10.731020 2018] [:error] [pid 13894] [remote 172.31.0.22:252]
[Wed Feb 14 22:17:10.731023 2018] [:error] [pid 13894] [remote 172.31.0.22:252] \x1b[31m>>> import nltk
[Wed Feb 14 22:17:10.731025 2018] [:error] [pid 13894] [remote 172.31.0.22:252] >>> nltk.download('punkt')
[Wed Feb 14 22:17:10.731027 2018] [:error] [pid 13894] [remote 172.31.0.22:252] \x1b[0m
[Wed Feb 14 22:17:10.731029 2018] [:error] [pid 13894] [remote 172.31.0.22:252] Searched in:
[Wed Feb 14 22:17:10.731031 2018] [:error] [pid 13894] [remote 172.31.0.22:252] - '/home/wsgi/nltk_data'
[Wed Feb 14 22:17:10.731034 2018] [:error] [pid 13894] [remote 172.31.0.22:252] - '/usr/share/nltk_data'
[Wed Feb 14 22:17:10.731036 2018] [:error] [pid 13894] [remote 172.31.0.22:252] - '/usr/local/share/nltk_data'
[Wed Feb 14 22:17:10.731038 2018] [:error] [pid 13894] [remote 172.31.0.22:252] - '/usr/lib/nltk_data'
[Wed Feb 14 22:17:10.731040 2018] [:error] [pid 13894] [remote 172.31.0.22:252] - '/usr/local/lib/nltk_data'
[Wed Feb 14 22:17:10.731043 2018] [:error] [pid 13894] [remote 172.31.0.22:252] - '/opt/python/run/venv/nltk_data'
[Wed Feb 14 22:17:10.731045 2018] [:error] [pid 13894] [remote 172.31.0.22:252] - '/opt/python/run/venv/lib/nltk_data'
[Wed Feb 14 22:17:10.731047 2018] [:error] [pid 13894] [remote 172.31.0.22:252] - ''
[Wed Feb 14 22:17:10.731049 2018] [:error] [pid 13894] [remote 172.31.0.22:252]
When I added the line
nltk.download('punkt')
to my application in order to ensure that the resource I need would be downloaded, I get this message in the error log:
[Wed Feb 14 22:30:07.861273 2018] [:error] [pid 28765] [nltk_data] Downloading package punkt to /home/wsgi/nltk_data...
which is then followed by a series of errors that comes down to:
[Wed Feb 14 22:30:07.864521 2018] [:error] [pid 28765] [remote 172.31.0.22:55448] FileNotFoundError: [Errno 2] No such file or directory: '/home/wsgi/nltk_data'
So I SSH-d into my EC2 instance, entered the virtual environment that it seems like my app is running on from the opt/python/run directory using
$source venv/bin/activate
and opened up the python interpreter. When I ran
>>import nltk
>>nltk.download('punkt')
I got back
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data... [nltk_data] Package punkt is already up-to-date! True
So I also tried
>>> nltk.data.load('tokenizers/punkt/english.pickle')
and got back:
<nltk.tokenize.punkt.PunktSentenceTokenizer object at 0x7fb8afd34080>
So, it seems like the nltk package on my EC2 instance knows where the nltk_data resource is as long as it's not being asked by my Flask application. I also tried entering
>>nltk.data.path.append('home/ec2-user/nltk_data')
and still got the same error as I posted above with no indication that my attempts append the list of paths to check for nltk_data had gone through.
I am not sure what I need to get nltk to locate where the nltk_data resource it is trying to find is located.
I have seen .ebextensions mentioned in reference to dependency issues and tried to read the AWS page about it, but am not sure exactly how it fits into the issue occurring with my application. Probably a learning-curve web dev literacy issue on my end.
Thanks for any clarity that can be provided regarding this situation!