AWS Lambda for extracting PDF to s3
- Download pdf to /tmp folder - since AWS Lambda is readonly file system
- Use the file name as folder and rename file to index.pdf as a convention
- Run poppler to:
- extract embedded images with pattern: index-#pagenumber_#imagenumber.jpg
- extract text and coordinates to: index.xml
- convert pdf to 1600px jpg with pattern: jpeg-1600-page-#pagenumber.jpg
- convert pdf to ppm with pattern: ppm-page-#pagenumber.ppm
- Upload to predefined s3 bucket - see Configuration
This can come in as query string in API Gateway.
Required. The url of pdf to download.
Optional. The destination path to store into your bucket. Default is url path.
Optional. The dpi of the output jpeg. Default is 150 for smaller web size.
- Error message as string.
- JSON result: { path: the path for directory listing, files: array of all files }
Lambda method must have access to specific s3 bucket.
Environment variable of destination bucket which can be configured in AWS API Gateway.
- Start an AWS EC2 micro instance
- Follow instruction: https://github.com/pjfoley/poppler-for-lambda
- Edit: build.sh and update poppler version to latest: poppler-0.37.0.tar.xz to poppler-0.45.0.tar.xz or latest.
- Run yum update below to get cairo, jpeg, and png libs before build.
sudo yum install cairo cairo-devel cairomm-devel libjpeg-turbo-devel pango pango-devel pangomm pangomm-devel libpng-devel
application/json -> Select -> Method Request passthrough