aws-samples / amazon-textract-enhancer Goto Github PK

This workshop demonstrates how to build a Document parser and query engine with Amazon Textract and other services, such as ElasticSearch and DynamoDB.

License: MIT No Attribution

Python 90.13% HTML 2.67% JavaScript 7.20%

amazon-textract-enhancer's Issues

Cloudformation deployment error? (Unable to access the S3 bucket for the original python code)

Been a while since my last deployment of this solution. Just tried this again recently, and found that it is not able to be deployed anymore? The python code is not accessible. Could you please fix this. Thanks.

The stack fails when we change the region.

I get hte bellow error.... Error occurred while GetObject. S3 Error Code: PermanentRedirect. S3 Error Message: The bucket is in this region: us-east-1. Please use this region to retry the request (Service: AWSLambdaInternal; Status Code: 400; Error Code: InvalidParameterValueException; Request ID: 46f2bd08-7640-4156-bde3-32727516a887)

I need to create this in us-east-2 (Ohio) region..

Kindly help any help is truly appreciated.

KeyError: 'Blocks' bash: KeyError:: command not found

I am trying to extract tables from a pdf file and export them to a csv file. I run into this error:

blocks=response['Blocks']
[root@ip-172-31-72-49 centos]# KeyError: 'Blocks'

`import webbrowser, os
import json
import boto3
import io
from io import BytesIO
import sys
from pprint import pprint

def get_rows_columns_map(table_result, blocks_map):
rows = {}
for relationship in table_result['Relationships']:
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
cell = blocks_map[child_id]
if cell['BlockType'] == 'CELL':
row_index = cell['RowIndex']
col_index = cell['ColumnIndex']
if row_index not in rows:
# create new row
rows[row_index] = {}

                # get the text value
                rows[row_index][col_index] = get_text(cell, blocks_map)
return rows

def get_text(result, blocks_map):
text = ''
if 'Relationships' in result:
for relationship in result['Relationships']:
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
word = blocks_map[child_id]
if word['BlockType'] == 'WORD':
text += word['Text'] + ' '
if word['BlockType'] == 'SELECTION_ELEMENT':
if word['SelectionStatus'] =='SELECTED':
text += 'X '
return text

def get_table_csv_results(file_name):

with open(file_name, 'rb') as file:
    img_test = file.read()
    bytes_test = bytearray(img_test)
    print('Image loaded', file_name)

# process using image bytes
# get the results
client = boto3.client('textract')

response = client.start_document_text_detection(
DocumentLocation={
    'S3Object': {
        'Bucket': s3BucketName,
        'Name': documentName
    }
})

blocks=response['Blocks']
pprint(blocks)

blocks_map = {}
table_blocks = []
for block in blocks:
    blocks_map[block['Id']] = block
    if block['BlockType'] == "TABLE":
        table_blocks.append(block)

if len(table_blocks) <= 0:
    return "<b> NO Table FOUND </b>"

csv = ''
for index, table in enumerate(table_blocks):
    csv += generate_table_csv(table, blocks_map, index +1)
    csv += '\n\n'

return csv

def generate_table_csv(table_result, blocks_map, table_index):
rows = get_rows_columns_map(table_result, blocks_map)

table_id = 'Table_' + str(table_index)

# get cells.
csv = 'Table: {0}\n\n'.format(table_id)

for row_index, cols in rows.items():
    
    for col_index, text in cols.items():
        csv += '{}'.format(text) + ","
    csv += '\n'
    
csv += '\n\n\n'
return csv

def main(file_name):
table_csv = get_table_csv_results(file_name)

output_file = 'output.csv'

with open(output_file, "wt") as fout:
    fout.write(table_csv)

print('CSV OUTPUT FILE: ', output_file)

s3BucketName = "chrisyou.sagemi.com"
documentName = "DETAIL.pdf"

if name == "main":
file_name = sys.argv[1]
main(file_name)`

Please make the solution available on other region

This is a very good sample solution. Please make this available in other region (via launch Cloudformation).

README.md error about synchronous calls

README.md mention:

"Making a synchronous call to query Textract API is not possible for multi-page PDF documents"

This phrase can be improved, since synchronous calls are not possible in any type of PDF file doesn't matter is this is a single-page or multi-page PDF document.

For reference: https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html
Documents for synchronous operations can be in PNG or JPEG format. Documents for asynchronous operations can also be in PDF format.

Non-provisionable CFN stack

Problem #1

Change bucket name for lambda function by global usage
Zip Lambda functions first(textract-lambda-code.zip) and upload to upper bucket
Set cloudwatch logs role arn for TextractDemoAPIDeployment

Feature request - Add API authentication or make API private

Hi,

Currently, the /retrievedocumentanalysisresult and /retrievetextdetectionresult are public and have no authentication. Recommend adding API authentication (e.g. IAM, Cognito, etc.) or making the API private to avoid the risk that someone test this project with sensitive prod docs and inadvertently leave them potentially publicly exposed. Yes, chances are slim as requester needs to know the bucket and key name for the API to return results, but still wanted to suggest this change.

Overly permissive IAM permission

amazon-textract-enhancer/templates/textract-api-stack.json

Line 304 in dd175cf

"Action": "iam:*",

This permission should be restricted, function has iam:* policy with resources:*.

Potential security risk with this level of permission

S3 Event sends URL encoded key names, causing Lambda handler to fail on Textract API calls

Hi,

Per S3 docs:

The s3 key provides information about the bucket and object involved in the event. The object key name value is URL encoded. For example, "red flower.jpg" becomes "red+flower.jpg" (Amazon S3 returns "application/x-www-form-urlencoded" as the content type in the response).

The current extract-Enhancer-TextractAsyncJobSubmitFunction Lambda does not URL decode the S3 key received in the event JSON, so I'm receiving errors when trying to parse objects that contain spaces or other URL-encoding relevant characters.

For example, If I upload a document named my test.pdf, the S3 event sent to the extract-Enhancer-TextractAsyncJobSubmitFunction function contains the key Records[0].s3.object.key = my+test.pdf.

The Textract API calls textract.start_document_analysis() and textract.start_document_text_detection() then fail because the DocumentLocation parameter has a value of my+test.pdf when it should instead be my test.pdf.

Can you add URL decoding to the S3 key name in the received events?

Process fails when input file contains spaces

Lambda TextractAsyncJobSubmitFunction runs to success, but actually fails to process the file. The following errors can be seen in CloudWatch:

An error occurred (InvalidParameterException) when calling the StartDocumentAnalysis operation: Request has invalid parameters
An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters

Two issues: 1. the problem specified above, and 2. the Lambda itself should probably fail under these circumstances (can't perform operation after retries).

Thanks for this code.

Issue with Policy Names

Hello, I am currently trying to implement the enhancer in my AWS environment. Unfortunately I always get a problem with the naming of the policy names. The Lambda functions assume the name "LambdaTextractRole" but the policy is created with the name " us-east-1-LambdaTextractRole". Also the CloudFormation is not executed completely (as you can see on the video).

Does anyone have any idea what a mistake this is?

KeyError: 'Relationships' thrown when parsing Textract page results in textract_util.py

Hi,

I'm running the demo project with a PDF and receiving the error below. I'm working on investigating, but wanted to open the issue for tracking.

START RequestId: 0b976833-8688-42b4-9865-fe3e2955f984 Version: $LATEST
1 messages recieved
JobId = d1069774cc8edcbf66b12ff7497476fab0f653e80368d24de569e5e135a4e56b
Status = SUCCEEDED
Timestamp = 1567367135
API = StartDocumentTextDetection
JobTag = TextractTextDetectionJob-e925e04aeb58efc79afb74f5c6953cc2
S3ObjectName = My Test.pdf
S3Bucket = 544941453660-scanned-documents
upload_prefix = d1069774cc8edcbf66b12ff7497476fab0f653e80368d24de569e5e135a4e56b
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 1000 Blocks from Textract Text Detection response
Retrieved 890 Blocks from Textract Text Detection response
5890 Blocks retrieved
Extracted Block Types:
PAGE = 10
LINE = 2353
WORD = 3527
Page-1 contains 184 Lines
Page-2 contains 388 Lines
Page-3 contains 487 Lines
Page-4 contains 251 Lines
Page-5 contains 262 Lines
Page-6 contains 352 Lines
Page-7 contains 371 Lines
Page-8 contains 23 Lines
Page-9 contains 35 Lines
'Relationships': KeyError
Traceback (most recent call last):
File "/var/task/detect-text-postprocess-page.py", line 74, in lambda_handler
document_text, num_lines = extractTextBody(blocks)
File "/var/task/textract_util.py", line 423, in extractTextBody
print("Page-
{}
contains
{}
Lines".format(page['Page'], len(page['Relationships'][0]['Ids'])))
KeyError: 'Relationships'

END RequestId: 0b976833-8688-42b4-9865-fe3e2955f984

Stack fails to launch when NewDocumentBucketNeeded = false

If you change the NewDocumentBucketNeeded parameter to false then the stack fails to create with an error message. I didn't grab the error message but it should be easy to reproduce.

aws-samples / amazon-textract-enhancer Goto Github PK

amazon-textract-enhancer's Issues

Cloudformation deployment error? (Unable to access the S3 bucket for the original python code)

The stack fails when we change the region.

KeyError: 'Blocks' bash: KeyError:: command not found

blocks=response['Blocks']
[root@ip-172-31-72-49 centos]# KeyError: 'Blocks'

Please make the solution available on other region

README.md error about synchronous calls

Non-provisionable CFN stack

Feature request - Add API authentication or make API private

Overly permissive IAM permission

S3 Event sends URL encoded key names, causing Lambda handler to fail on Textract API calls

Process fails when input file contains spaces

Issue with Policy Names

KeyError: 'Relationships' thrown when parsing Textract page results in textract_util.py

Stack fails to launch when NewDocumentBucketNeeded = false

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

aws-samples / amazon-textract-enhancer Goto Github PK

amazon-textract-enhancer's Issues

blocks=response['Blocks'] [root@ip-172-31-72-49 centos]# KeyError: 'Blocks'

Recommend Projects

Recommend Topics

Recommend Org

blocks=response['Blocks']
[root@ip-172-31-72-49 centos]# KeyError: 'Blocks'