Giter Site home page Giter Site logo

microsoft / pubsec-info-assistant Goto Github PK

View Code? Open in Web Editor NEW
258.0 98.0 486.0 109.15 MB

Information Assistant, built with Azure OpenAI Service, Industry Accelerator

License: MIT License

Dockerfile 0.62% Shell 11.09% Makefile 0.45% Python 40.75% HTML 0.39% TypeScript 26.49% CSS 7.50% HCL 12.72%

pubsec-info-assistant's Introduction

Information Assistant Accelerator

Important

As of November 15, 2023, Azure Cognitive Search has been renamed to Azure AI Search. Azure Cognitive Services have also been renamed to Azure AI Services.

Table of Contents

Open in GitHub Codespaces

This industry accelerator showcases integration between Azure and OpenAI's large language models. It leverages Azure AI Search for data retrieval and ChatGPT-style Q&A interactions. Using the Retrieval Augmented Generation (RAG) design pattern with Azure OpenAI's GPT models, it provides a natural language interaction to discover relevant responses to user queries. Azure AI Search simplifies data ingestion, transformation, indexing, and multilingual translation.

The accelerator adapts prompts based on the model type for enhanced performance. Users can customize settings like temperature and persona for personalized AI interactions. It offers features like explainable thought processes, referenceable citations, and direct content for verification.

Please see this video for use cases that may be achievable with this accelerator.

Response Generation Approaches

Work(Grounded)

It utilizes a retrieval-augmented generation (RAG) pattern to generate responses grounded in specific data sourced from your own dataset. By combining retrieval of relevant information with generative capabilities, It can produce responses that are not only contextually relevant but also grounded in verified data. The RAG pipeline accesses your dataset to retrieve relevant information before generating responses, ensuring accuracy and reliability. Additionally, each response includes a citation to the document chunk from which the answer is derived, providing transparency and allowing users to verify the source. This approach is particularly advantageous in domains where precision and factuality are paramount. Users can trust that the responses generated are based on reliable data sources, enhancing the credibility and usefulness of the application. Specific information on our Grounded (RAG) can be found in RAG

Ungrounded

It leverages the capabilities of a large language model (LLM) to generate responses in an ungrounded manner, without relying on external data sources or retrieval-augmented generation techniques. The LLM has been trained on a vast corpus of text data, enabling it to generate coherent and contextually relevant responses solely based on the input provided. This approach allows for open-ended and creative generation, making it suitable for tasks such as ideation, brainstorming, and exploring hypothetical scenarios. It's important to note that the generated responses are not grounded in specific factual data and should be evaluated critically, especially in domains where accuracy and verifiability are paramount.

Work and Web

It offers 3 response options: one generated through our retrieval-augmented generation (RAG) pipeline, and the other grounded in content directly from the web. When users opt for the RAG response, they receive a grounded answer sourced from your data, complete with citations to document chunks for transparency and verification. Conversely, selecting the web response provides access to a broader range of sources, potentially offering more diverse perspectives. Each web response is grounded in content from the web accompanied by citations of web links, allowing users to explore the original sources for further context and validation. Upon request, It can also generate a final response that compares and contrasts both responses. This comparative analysis allows users to make informed decisions based on the reliability, relevance, and context of the information provided. Specific information about our Grounded and Web can be found in Web

Assistants

It generates response by using LLM as a reasoning engine. The key strength lies in agent's ability to autonomously reason about tasks, decompose them into steps, and determine the appropriate tools and data sources to leverage, all without the need for predefined task definitions or rigid workflows. This approach allows for a dynamic and adaptive response generation process without predefining set of tasks. It harnesses the capabilities of LLM to understand natural language queries and generate responses tailored to specific tasks. These Agents are being released in preview mode as we continue to evaluate and mitigate the potential risks associated with autonomous reasoning, such as misuse of external tools, lack of transparency, biased outputs, privacy concerns, and remote code execution vulnerabilities. With future releases, we plan to work to enhance the safety and robustness of these autonomous reasoning capabilities. Specific information on our preview agents can be found in Assistants.

Features

The IA Accelerator contains several features, many of which have their own documentation.

  • Examples of custom Retrieval Augmented Generation (RAG), Prompt Engineering, and Document Pre-Processing
  • Azure AI Search Integration to include text search of both text documents and images
  • Customization and Personalization to enable enhanced AI interaction
  • Preview into autonomous agents

For a detailed review see our Features page.

Process Flow for Work(Grounded), Ungrounded, and Work and Web

Process Flow for Chat

Process Flow for Assistants

Process Flow for Assistants

Azure account requirements

IMPORTANT: In order to deploy and run this example, you'll need:

  • Azure account. If you're new to Azure, get an Azure account for free and you'll get some free Azure credits to get started.
  • Azure subscription with access enabled for the Azure OpenAI service. You can request access with this form.
    • Access to one of the following Azure OpenAI models:

      Model Name Supported Versions
      gpt-35-turbo current version
      gpt-35-turbo-16k current version
      gpt-4 current version
      gpt-4-32k current version

      Important: Gpt-35-turbo-16k (0613) is recommended. GPT 4 models may achieve better results from the IA Accelerator.

    • (Optional) Access to the following Azure OpenAI model for embeddings. Some open source embedding models may perform better for your specific data or use case. For the use case and data Information Assistant was tested for we recommend using the following Azure OpenAI embedding model.

      Model Name Supported Versions
      text-embedding-ada-002 current version
  • Azure account permissions:
    • Your Azure account must have Microsoft.Authorization/roleAssignments/write permissions, such as Role Based Access Control Administrator, User Access Administrator, or Owner on the subscription.
    • Your Azure account also needs Microsoft.Resources/deployments/write permissions on the subscription level.
    • Your Azure account also needs microsoft.directory/applications/create and microsoft.directory/servicePrincipals/create, such as Application Administrator Entra built-in role.
  • To have accepted the Azure AI Services Responsible AI Notice for your subscription. If you have not manually accepted this notice please follow our guide at Accepting Azure AI Service Responsible AI Notice.
  • (Optional) Have Visual Studio Code installed on your development machine. If your Azure tenant and subscription have conditional access policies or device policies required, you may need to open your GitHub Codespaces in VS Code to satisfy the required polices.

Deployment

Please follow the instructions in the deployment guide to install the IA Accelerator in your Azure subscription.

Once completed, follow the instructions for using IA Accelerator for the first time.

You may choose to view the deployment and usage click-through guides to see the steps in action. These videos may be useful to help clarify specific steps or actions in the instructions.

Responsible AI

The Information Assistant (IA) Accelerator and Microsoft are committed to the advancement of AI driven by ethical principles that put people first.

Transparency Note

Read our Transparency Note

Find out more with Microsoft's Responsible AI resources

Content Safety

Content safety is provided through Azure OpenAI service. The Azure OpenAI Service includes a content filtering system that runs alongside the core AI models. This system uses an ensemble of classification models to detect four categories of potentially harmful content (violence, hate, sexual, and self-harm) at four severity levels (safe, low, medium, high).These 4 categories may not be sufficient for all use cases, especially for minors. Please read our Transaparncy Note

By default, the content filters are set to filter out prompts and completions that are detected as medium or high severity for those four harm categories. Content labeled as low or safe severity is not filtered.

There are optional binary classifiers/filters that can detect jailbreak risk (trying to bypass filters) as well as existing text or code pulled from public repositories. These are turned off by default, but some scenarios may require enabling the public content detection models to retain coverage under the customer copyright commitment.

The filtering configuration can be customized at the resource level, allowing customers to adjust the severity thresholds for filtering each harm category separately for prompts and completions.

This provides controls for Azure customers to tailor the content filtering behavior to their needs while aiming to prevent potentially harmful generated content and any copyright violations from public content.

Instructions on how to confiure content filters via Azure OpenAI Studio can be found here https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/content-filters#configuring-content-filters-via-azure-openai-studio-preview

Data Collection Notice

The software may collect information about you and your use of the software and send it to Microsoft. Microsoft may use this information to provide services and improve our products and services. You may turn off the telemetry as described in the repository. There are also some features in the software that may enable you and Microsoft to collect data from users of your applications. If you use these features, you must comply with applicable law, including providing appropriate notices to users of your applications together with a copy of Microsoft’s privacy statement. Our privacy statement is located at https://go.microsoft.com/fwlink/?LinkID=824704. You can learn more about data collection and use in the help documentation and our privacy statement. Your use of the software operates as your consent to these practices.

About Data Collection

Data collection by the software in this repository is used by Microsoft solely to help justify the efforts of the teams who build and maintain this accelerator for our customers. It is your choice to leave this enabled, or to disable data collection.

Data collection is implemented by the presence of a tracking GUID in the environment variables at deployment time. The GUID is associated with each Azure resource deployed by the installation scripts. This GUID is used by Microsoft to track the Azure consumption this open source solution generates.

How to Disable Data Collection

To disable data collection, follow the instructions in the Configure ENV files section for ENABLE_CUSTOMER_USAGE_ATTRIBUTION variable before deploying.

Resources

Navigating the Source Code

This project has the following structure:

File/Folder Description
.devcontainer/ Dockerfile, devcontainer configuration, and supporting script to enable both GitHub Codespaces and local DevContainers.
app/backend/ The middleware part of the IA website that contains the prompt engineering and provides an API layer for the client code to pass through when communicating with the various Azure services. This code is python based and hosted as a Flask app.
app/enrichment/ The text-based file enrichment process that handles language translation, embedding the text chunks, and inserting text chunks into the Azure AI Search hybrid index. This code is python based and is hosted as a Flask app that subscribes to an Azure Storage Queue.
app/frontend/ The User Experience layer of the IA website. This code is Typescript based and hosted as a Vite app and compiled using npm.
azure_search/ The configuration of the Azure Search hybrid index that is applied in the deployment scripts.
docs/adoption_workshop/ PPT files that match what is covered in the Adoption Workshop videos in Discussions.
docs/deployment/ Detailed documentation on how to deploy and start using Information Assistant.
docs/features/ Detailed documentation of specific features and development level configuration for Information Assistant.
docs/ Other supporting documentation that is primarily linked to from the other markdown files.
functions/ The pipeline of Azure Functions that handle the document extraction and chunking as well as the custom CosmosDB logging.
infra/ The Terraform scripts that deploy the entire IA Accelerator. The overall accelerator is orchestrated via the main.tf file but most of the resource deployments are modularized under the core folder.
pipelines/ Azure DevOps pipelines that can be used to enable CI/CD deployments of the accelerator.
scripts/environments/ Deployment configuration files. This is where all external configuration values will be set.
scripts/ Supporting scripts that perform the various deployment tasks such as infrastructure deployment, Azure WebApp and Function deployments, building of the webapp and functions source code, etc. These scripts align to the available commands in the Makefile.
tests/ Functional Test scripts that are used to validate a deployed Information Assistant's document processing pipelines are working as expected.
Makefile Deployment command definitions and configurations. You can use make help to get more details on available commands.
README.md Starting point for this repo. It covers overviews of the Accelerator, Responsible AI, Environment, Deployment, and Usage of the Accelerator.

References

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Reporting Security Issues

For security concerns, please see Security Guidelines

pubsec-info-assistant's People

Contributors

arpitaisan0maly avatar asbanger avatar avidunixuser avatar dayland avatar dayland-ms avatar dependabot[bot] avatar georearl avatar github-merge-queue[bot] avatar hyoshioka0128 avatar juliansoh avatar kronemeyerjoshua avatar kylewerts avatar lmwilki avatar lon-tierney avatar mausolfj avatar microsoft-github-operations[bot] avatar microsoftopensource avatar nhwkuhns avatar paullizer avatar rohrerb avatar ryonsteele avatar wotey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pubsec-info-assistant's Issues

Action required: self-attest your goal for this repository

It's time to review and renew the intent of this repository

An owner or administrator of this repository has previously indicated that this repository can not be migrated to GitHub inside Microsoft because it is going public, open source, or it is used to collaborate with external parties (customers, partners, suppliers, etc.).

Action

👀 ✍️ In order to keep Microsoft secure, we require repository owners and administrators to review this repository and regularly renew the intent to either opt-in or opt-out of migration to GitHub inside Microsoft which is specifically intended for private or internal projects.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived. 🔒

Instructions

❌ Opt-out of migration

If this repository can not be migrated to GitHub inside Microsoft, you can opt-out of migration by replying with a comment on this issue containing one of the following optout command options below.

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

  • staging : My project will ship as Open Source
  • collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete : This repository will be deleted because it is no longer needed.
  • other : Other reasons not specified

✅ Opt-in to migrate

If the circumstances of this repository has changed and you decide that you need to migrate, then you can specify the optin command below. For example, the repository is no longer going public, open source or require external collaboration.

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

Click here for more information about optin and optout command options and examples

Opt-in

@gimsvc optin --date <target_migration_date>

When opting-in to migrate your repository, the --date option is required followed by your specified migration date using the format: mm-dd-yyyy

@gimsvc optin --date 03-15-2023

Opt-out

@gimsvc optout --reason <staging|collaboration|delete|other>

When opting-out of migration, you need to specify the --reason.

  • staging
    • My project will ship as Open Source
  • collaboration
    • Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete
    • This repository will be deleted because it is no longer needed.
  • other
    • Other reasons not specified

Examples:

@gimsvc optout --reason staging

@gimsvc optout --reason collaboration

@gimsvc optout --reason delete

@gimsvc optout --reason other

Need more help? 🖐️

Sizing Estimator - Missing costestimator.md in the docs directory

Describe the bug
The main README.md has a section on Sizing Estimator, which points a non existent document costestimator.md in the docs folder

Alpha version details

  • GitHub branch: main
  • Latest commit: [obtained by running git log -n 1 <branchname>
$ git log -n 1 main
commit b5dc8b368758b479b4f94b061b43c4b8ac94cb00 (HEAD -> main, origin/main, origin/HEAD)
Merge: ca99857 797ff84
Author: dayland <[email protected]>
Date:   Fri Jun 23 08:11:16 2023 +0000

    Merge pull request #101 from microsoft/dayland/5707-missing-search-indexer-property
    
    add "allowSkillsetToReadFileData" property to search indexer

Support multiple front end (Desktop, PVA, Power Platform, D365, CoPilots)) using APIs or SDK

Is your feature request related to a problem? Please describe.
Currently, IA is a website. If needed in other platforms (Desktop, Power Platform, SharePoint), the only way is to use IFrame, which is not preferred.

Possible solution
Abstract all functionalities to an API layer (or possibly a SDK) with API security. That way, IA becomes a hosting platform agnostic solution and can be called by anybody.
By doing so, I can leverage IA's ingestion process only and mix and match.

Non-PDF text-based documents should not queue to text_enrichment queue in 0.3-Gamma

Describe the bug
Currently non-PDF text-based files get queued to the "text_enrichment" queue after chunking is complete which is not needed.

Expected behavior
Non-PDF text-based files should report a status of "Complete" after chunking is finished.

Desktop (please complete the following information):

  • OS: Windows 11
  • Browser Edge
  • Version

Version details

  • GitHub branch: main
  • Latest commit: commit a2a9f4e (HEAD -> main, origin/main, origin/HEAD)

Additional context
None

Deploying vNext-Dev error

When attempting to deploy the latest version of the vNext-Dev branch an error is occurring that prevents the deployment from completing.

Failed to parse main.parameters.json with exception:
Failed to parse 'main.parameters.json', please check whether it is a valid JSON format

The error appears to be caused by a missing value in the main.parameters.json when it is written out during the build process

"chatGptDeploymentCapacity": {
"value":
},

This maybe related to issue #282

Container didn't respond to HTTP pings on port: 8000

Discussed in #285

Originally posted by andresravinet October 18, 2023
Can anyone help me? I'm getting an Application Error when browsing to the web app. And the below is what is showing up in the log.
Container infoasst-web-6cq0e_0_17db8b46 didn't respond to HTTP pings on port: 8000, failing site start. See container logs for debugging.

Citations not showing 100% of the time

Detailed Description
Citations are not showing 100% of the time in responses. This is more prevalent in GPT 3.5 with chat completion API (more likely to not get citation versus get citation). Citations are more consistent in GPT 4.

Expected behavior
Any response that is not a 'I do not have enough information' should have citations for the answer.

Version details

  • All

ChatGPT Model versions updated in Australia East

Describe the bug
Error when doing deployment to Australia East
InvalidResourceProperties - The specified SKU 'Standard' for model 'gpt-35-turbo 0301' is not supported in this region 'australiaeast'.
To fix this I updated main.bicep line 147: version: '0613' as this is the only version available in Australia East

To Reproduce
Steps to reproduce the behavior:
in bash terminal $ make deploy
Expected behavior
Should complete deployment

Alpha version details

  • GitHub branch: main
  • Latest commit: [obtained by running git log -n 1 <branchname>

Additional context
Add any other context about the problem here.

The ability (or instructions) on how to remove documents from the system

In our testing of the accelerator we have noticed that some documents become dominant, and/or are uploaded in error and skew the results. Currently the way to resolve this appears to be delete the whole accelerator instance and start again. While during development iterations this is OK, once we get to a "Production" instance the ability to manage the data will be critical. Could we get either the ability to remove documents from the 'content library' in a future build?

Fail to process documents due to missing language pickle file

Describe the bug
Upon installation, documents uploaded fail to process with error message and 200 response code in logs:
Screenshot 2023-11-22 093007

Upon investigation, it appears that the download of the English language file has failed, though the ZIP file is present:

Screenshot 2023-11-22 112703

Per other deployments, this ZIP file should have been extracted. During runtime, the service attempts to download the file again, but then sees the ZIP file, and stops. Likely the root cause is the file not being extracted.

Note, the built-in gzip tool is unable to decompress as it does not recognize a ".zip" extension. Unknown if that is related to the file not being decompressed?

Version details

  • GitHub branch: main (Delta)

  • Latest commit:

    Merge pull request #351 from microsoft/ryonsteele/6258-deploymentname-documentation

    Update default bicep params to use default flavor model for deployment name

Additional context
Add any other context about the problem here.

Token limit often exceeded with PDF files

We have some large PDF files and during the chunking process, it seems to be often creating chunks that well exceed the "target size". For example, one document (which can be downloaded here), has one chunk over 80,000 tokens in length.

There are several other chunks created from this same file that are smaller but still exceed the target size by a substantial amount.

WebApp Deployment Failed: Diagnostic settings does not support retention for new diagnostic settings.

Describe the bug
Diagnostic settings does not support retention for new diagnostic settings. (Code: BadRequest, Target: /subscriptions//resourceGroups/infoasst-ws-peter/providers/Microsoft.Resources/deployments/web)

Complete log
ERROR: {"status":"Failed","error":{"code":"DeploymentFailed","target":"/subscriptions//providers/Microsoft.Resources/deployments/infoasst-ws-peter","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"ResourceDeploymentFailure","target":"/subscriptions/1/resourceGroups/infoasst-ws-peter/providers/Microsoft.Resources/deployments/web","message":"The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.","details":[{"code":"DeploymentFailed","target":"/subscriptions//resourceGroups/infoasst-ws-peter/providers/Microsoft.Resources/deployments/web","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"BadRequest","target":"/subscriptions/1/resourceGroups/infoasst-ws-peter/providers/Microsoft.Resources/deployments/web","message":"{\r\n "code": "BadRequest",\r\n "message": "Diagnostic settings does not support retention for new diagnostic settings."\r\n}"}]}]}]}}
make: *** [Makefile:18: infrastructure] Error 1

To Reproduce
make deploy

Expected behavior
no error message should appear

Beta version details

  • GitHub branch: [e.g. main]

Error when GPT3-5-16k is not available in your region

Describe the bug
image

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Alpha version details

  • GitHub branch: [e.g. main]
  • Latest commit: [obtained by running git log -n 1 <branchname>

Additional context
Add any other context about the problem here.

Errors with new GPT-4 and GPT-3.5 versions

When attempting to use GPT4 0613 or GPT3.5

The following error comes up:
The completion operation does not work with the specified model, gpt-4. Please choose different model and try again. You can learn more about which models can be used with each operation here: https://go.microsoft.com/fwlink/?linkid=2197993.

This looks like a code base error as the Completion operation is not supported by gpt-35-turbo(0613) and gpt-35-turbo-16k(0613) models. These models only support Chat Completions API. Only older turn model - GPT-3.5 Turbo (0301) supports both Chat and completions API. Please refer to the GPT-3.5 models for details.

On top of that, the accelerator is also using the old version of even GPT3.5, 0301. we need to switch to the new version, 0613, which yields better results. This will require the step above as well.

This should be configurable and not hard coded on the bicep.

Reference to the Bicep code https://github.com/microsoft/PubSec-Info-Assistant/blob/deb64086a5b1c5b720fdd6e7dbded05db7f8d2cc/infra/main.bicep#L180C9-L187C46

model: {
format: 'OpenAI'
name: chatGptModelName
version: '0301'
}
sku: {
name: 'Standard'
capacity: chatGptDeploymentCapacity

vite build error: "type" is not exported by "__vite-browser-external"

Describe the bug
When running make deploy at the vite build state I get the following erros

12 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities

> [email protected] build
> tsc && vite build

vite v4.4.0 building for production...
node_modules/@azure/storage-blob/dist-esm/storage-blob/src/TelemetryPolicyFactory.js (32:70) "type" is not exported by "__vite-browser-external", imported by "node_modules/@azure/storage-blob/dist-esm/storage-blob/src/TelemetryPolicyFactory.js".
node_modules/@azure/storage-blob/dist-esm/storage-blob/src/TelemetryPolicyFactory.js (32:83) "release" is not exported by "__vite-browser-external", imported by "node_modules/@azure/storage-blob/dist-esm/storage-blob/src/TelemetryPolicyFactory.js".
Terminated
make: *** [Makefile:15: build] Error 143

To Reproduce
Steps to reproduce the behavior:

  1. az login
  2. edit scripts/environment/local.env
  3. make deploy
  4. See errror

Expected behavior
The deploy process to complete without erros

Screenshots
image

Desktop (please complete the following information):

  • OS: Codespace - Debian GNU/Linux 11 (bullseye), Host OS: Ubuntu 22.04.02 LTS
  • Browser: Firefox (same issue in vscode)
  • Version: Firefox114.0.1 (64-bit)

Alpha version details

  • GitHub branch: main
  • Latest commit: [obtained by running git log -n 1 <branchname>

commit b5dc8b368758b479b4f94b061b43c4b8ac94cb00 (HEAD -> main, origin/main, origin/HEAD)
Merge: ca99857 797ff84
Author: dayland <[email protected]>
Date:   Fri Jun 23 08:11:16 2023 +0000

    Merge pull request #101 from microsoft/dayland/5707-missing-search-indexer-property
    
    add "allowSkillsetToReadFileData" property to search indexer

Additional context
Add any other context about the problem here.

Answers do not appear to be created from chunks returned

Describe the bug
You may ask the system a question at the start of a chat (or mid-chat) and notice that the answer provided by the system is not based upon data contained within the response chunks provided to OpenAI service. This may happen when the answer given is correct, or when it is incorrect (unrelated to accuracy of the answer).

To Reproduce
Steps to reproduce the behavior:

  1. Ask a question that the system does not have information in documents to answer, such as "I use my boat for personal purposes only. Can I deduct the costs of the boat from my taxes?"
  2. Note that the response may be factual, and may cite a document.
  3. Review the cited document and see that the data required is nowhere in the document.
    a. Specific document used: https://files.taxfoundation.org/20210823155834/TaxEDU-Primer-Common-Tax-Questions-Answered.pdf
  4. Review the Thought Process to confirm that no chunk provided contained the answer

Expected behavior
The system should say that it is unable to answer the question, and not cite any document or document chunk.

Screenshots
b7f86019-dd29-466c-8d09-f970611cb059

Desktop (please complete the following information):

  • OS: Windows 11
  • Browser: Edge

Alpha version details

  • GitHub branch: main
  • Tag

Additional context
Tracked in AzDO 5791

Web app doesn't start when there is new AOAI resource used

Describe the bug
With the new Delta version the web app doesn't start when there is new AOAI resource used.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy a brand new Delta version with new AOAI resource (USE_EXISTING_AOAI=false)
  2. Web app expecting model deployment with the name of "chat" but the default name is "gpt-35-turbo-16k"

Expected behavior
Web app up and running :)

Delta version details

  • GitHub branch: main
  • Latest commit: 0cc8d71

Text Enrichment function not quoting blob paths correctly

We have some files with percentage (%) symbols in them, which appear to cause an issue when getting to the Text Enrichment stage of the Function App due to the way the get_blob_and_sas function works. Example file name: Unemployment rate back up to 3.7% in October _ Australian Bureau of Statistics.pdf

I would suggest replacing the code that manually substitutes spaces (below) with a proper URL quoting function like blob_path = urllib.parse.quote(blob_path)

source_blob_path = source_blob_path.replace(" ", "%20")

Installer seems to ignore authentication setting

Describe the bug
Post-installation, the application requires end user to add themselves as an authorized user.

To Reproduce
Steps to reproduce the behavior:

  1. Install from main (Delta) with authentication set to false:

Screenshot 2023-11-22 124757
4. Find website as instructed in instructions
5. Note that error is received stating that the application has been configured to block users unless they are specifically granted access.
6. See error
Screenshot 2023-11-22 124909

Expected behavior

Authentication setting should be honored.

Version details

  • GitHub branch: main (Delta)

Unable to deploy environment.

Describe the bug
Unable to deploy the accelerator. I am following the instructions. I sucesfuly created env file.
I also created workspace and resource group on the Azure Subscription.

To Reproduce
Steps to reproduce the behavior:
When I try to build environment, I get an error.

Expected behavior
Infrastructure should be deployed.

Screenshots
image

Desktop (please complete the following information):
Windows 11, Visual Studio Code

What I am doing wrong?

Tags can only contain ASCII, fails silently

Describe the bug
Adding tags containing none-ascii characters when uploading files results in an error.
In the console:
deserializationPolicy.js:164 Uncaught (in promise) RestError: The metadata specified is invalid. It has characters that are not permitted.

The reason is that Azure blob storage metadata can only contain ASCII.

To Reproduce

  1. Go to Manage content, Upload files
  2. Add a tag: rødgrød
  3. Drop a file in the square.

The file is written to the log-database ("uploaded"), but it never actually reaches the blob storage.

Expected behavior

  • At least display an error in the UI.
  • The tag should be encoded in some way and decoded when in use.

Alpha version details

  • GitHub branch: main

Azure Function not running for uploaded files making them unavailable to chat with

Describe the bug
A clear and concise description of what the bug is.
I am receiving an error message when asking my Accelerator questions. Pictures included for reference on this issue.

Screenshot 2023-08-07 095337 Screenshot 2023-08-07 101836 Screenshot 2023-08-07 101920 Screenshot 2023-08-07 104048

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error
    Error message show in screenshots when asking a question as well as when chatting.

Expected behavior
A clear and concise description of what you expected to happen.
I expected to receive an answer to my question referencing the PDF documents I had uploaded to my accelerator, however each time I am met with an error message.
Screenshots
If applicable, add screenshots to help explain your problem.
Included above.
Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]
    Windows 11 Enterprise, 22H2 Version, Microsoft Edge browser
    Alpha version details
  • GitHub branch: [e.g. main]
  • Latest commit: [obtained by running git log -n 1 <branchname>

Additional context
Add any other context about the problem here.

Model text-davinci-003 not supported

Describe the bug
Can't deploy because the model isn't supported. Tried to create model in AOAI Studio and it doesn't exist.

This is the error I'm getting...

🎯 Target Resource Group: infoasst-myworkspct

InvalidTemplateDeployment - The template deployment 'infoasst-myworkspct' is not valid according to the validation procedure. The tracking id is '48674f37-c551-4d33-aa0f-017697b7dd2d'. See inner errors for details.
DeploymentModelNotSupported - Creating account deployment is not supported by the model 'text-davinci-003'. This is usually because there are better models available for the similar functionality.
make: *** [Makefile:18: infrastructure] Error 1

Is there a workaround?

First makedeploy error: Error while attempting to download Bicep CLI

When running makedeploy the first time, I got this error:
[Errno 1] Operation not permitted: '/home/vscode/.azure/bin/bicep'

Then i run makedeploy again and got another error:
nvalidTemplate - Deployment template validation failed: 'The template resource 'infoasst-aoai-pwrsg/' for type 'Microsoft.CognitiveServices/accounts/deployments' at line '1' and column '2133' has incorrect segment lengths. A nested resource type must have identical number of segments as its resource name. A root resource type must have segment length one greater than its resource name

Error: Diagnostic settings does not support retention for new diagnostic settings

After a couple of attempts to deploy the PubSec suite (deploy, delete, repeat switching from australia east to eastus), I began to encounter this error 'Diagnostic settings does not support retention for new diagnostic settings.' and the deployment would fail. With each attempt I had deleted all of the services created by the previous attempt, and changed the WORKSPACE="" to be unique.

I happened to come across this article:
https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/migrate-to-azure-storage-lifecycle-policy

After changing lines 138, 146 and 156 in the main.bicep file, setting the days value to 0 (instead of the default value of 30) the deployment completed successfully.

based on the information in the article, we'll need to update these setting after September when the deprecation comes into effect.

Problems with foreign characters

Describe the bug
Foreign characters (danish "æøå" in this case) are incorrectly displayed in "Citation -> Document section -> Content". For instance, "å" is shown as "Ã¥". Looks like a mixup of utf-8 and 16.

To Reproduce
Steps to reproduce the behavior:

  • Upload a html document stored in utf-8 w. bom. In that file foreign characters are not escaped (and shouldn't be, because utf-8).

Expected behavior
The user should be presented with the proper characters in the UI.

Alpha version details

  • GitHub branch: main (gamma)
  • Latest commit: 594c965

Additional context

Deployment without Administrative rights on the Azure subscription

I cannot run a make deploy due to limited rights on our public sector company Azure subscription ie the requirement to have administrative rights for the subscription. It is not common to have those rights for good reason in most public companies for security reasons, especially with experimental deployments like this Accelerator. What I have gotten is ownership of a resource group, but I do have to create the resource in order to deploy this stack, so I am at odds to get this application going for demonstration purposes.

A possibility to deploy by reference to the owned resource group or even a kubernetes or other containerized build.

I have tried to deploy to an existing resource group under my ownership to no avail. Alternatively I will build an application myself, but it is out of scope for this demonstrator and I do believe that easier deployment is good for these Accelerators.

Admin rights for azure screendump

OpenAPI specification for backend

Is your feature request related to a problem? Please describe.
Some customers would like to use only the backend part of IA and bring their own frontend. Also there is a requirement for APIM integration.

Describe the solution you'd like
OpenAPI (Swagger) specification for the chat endpoint and other endpoints, in a same way like at the enrichmentweb app.

Inconsistencies in File Upload -> Queued/Processing

We're having issues in our instance with Uploading files ... Word & PDF docs. Some of our test docs are only 15-20kb, some PDFs 3-5Mb.

File Upload appears to work (Upload bar goes grey + green tick). But the file doesn't turn up in the Queue in Upload Status. This works sometimes (we have a mix of 15-20 files uploaded) but sometimes not (seem most of the time over the past week or so).

Various users, different times of the day.

We have adjusted the Cognitive Search Indexer to a 5min interval, but I don't think it's even making it into the queue.

Anyone having similar issue ... or thoughts on things to check/fix pls?

++++

release: 0.3 Gamma

Instance: infoasst-aoai-3z1n0
Deployment Name: gpt-35-turbo-16k
Model Name: gpt-35-turbo-16k
Model Version: 0613

Azure Cognitive Search
Service Name: infoasst-search-3z1n0
Index Name: all-files-index

System Configuration
System Language: English


Desktop
Windows 11 23H2
Edge: 119.0.2151.12
... but issue is affecting various other users/desktop/browser combinations

Error when using danish texts

Describe the bug
When uploading danish documents, they are not processed. If the documents are in English they are successfully processed.

To Reproduce

  1. Upload a danish text.
  2. In the upload status of the web app, the document appears as "Completed"
Screenshot 2023-11-03 at 07 59 44
  1. The chat does not use the document to answer.
  2. Going to the Azure Portal, and inspecting the 'Search Service'. We see that the 'Indexers' file all-files-indexers are returning an Error.
Screenshot 2023-11-03 at 07 51 24

The error is not much informative.
6. However, if we create an 'Debug session' in the 'Search service' on the danish PDF, we see more information:

Screenshot 2023-11-03 at 07 53 27
  1. So it seems like the PII skill does not support the danish language, but the documentation describes that it does: https://learn.microsoft.com/en-us/azure/ai-services/language-service/personally-identifiable-information/language-support?tabs=documents
  2. We have tried changing the PII language in the code to EN but without luck (in the language specification file). It is still the same error.

Expected behavior
That danish files works.

Desktop (please complete the following information):

  • Azure environment
  • Browser: chrome and safari

Alpha version details

  • GitHub branch: 0.3 Main
  • Latest commit: 0f9d4f6

ERROR: failed to export: exporting app layers: caching layer

Describe the bug
After running azd up, it creates the necessary docker images, everything seems fine until this last step and I received this error

"
[exporter] ERROR: failed to export: exporting app layers: caching layer (sha256:97e68c8f07ff95f2a9c3d994aa8fb9800cda648748531e3d9f595e53c29e9351): write /launch-cache/staging/sha256:97e68c8f07ff95f2a9c3d994aa8fb9800cda648748531e3d9f595e53c29e9351.tar: no space left on device
, stderr: [builder]
[builder] [notice] A new release of pip is available: 23.2.1 -> 23.3.1
[builder] [notice] To update, run: pip install --upgrade pip
ERROR: failed to build: executing lifecycle: failed with status code: 62
"

I have over 300GB free space left on my Mac.

Desktop (please complete the following information):

  • OS: MacOS Sonoma on Apple Silicon

GitHub Broken Links

Describe the bug
User are reporting issues on broken links for GitHub while navigating to the documentation and other github liks.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '[Readme.md] (https://github.com/docs/development_environment.md)'
  2. Click on 'URLs within the readme.md file'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
image

Additional context
Add any other context about the problem here.

Deployment Error: Insufficient quota - GPT-35-Turbo-16K

Describe the bug
Users are facing issue while deploying "gpt-35-turbo-16K" as default capacity set to 720

To Reproduce
Steps to reproduce the behavior:

  1. Run "make deploy"

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
image

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

URL/Website content extraction

Feature request
Many federal customers have public documents (PDFs) and websites (including FAQs) that they would like to search using Info-Assistant.

Additional Details
Support crawling and extracting content from URL/website with recursion up to a certain configurable depth. Also provide support for filtering out certain URLs like forms, pages that call APIs (like office locator and such) and/or certain domains.

Support for M365 (SharePoint Online, OneDrive) content

Issue at hand
Almost all federal customers have big M365 presence. And have lots of documentation in SharePoint Online (SPO), and OneDrive. Few of my customers (VA, BEA, and NIH) have docs in SPO.

A possible solution
Extract and index documents and pages (site contents) in SPO and OneDrive using Graph SDK/REST

Describe alternatives you've considered
Other ways are to use LogicApps to get data from SPO, store them in blob and then use Cog Search. It is very painful and adds complexity like data sync issues, duplicate storage, access and retention.

MacOS / bash 5.2 compatibility updates

MacOS 13.5, M1 Pro.

deploy-enrichment-webapp.sh, deploy-webapp.sh:

# original
# end=`date -u -d "3 years" '+%Y-%m-%dT%H:%MZ'`

# Bash 5.2 - friendly
end=$(date -u -v+3y '+%Y-%m-%dT%H:%MZ')

inf-create.sh:

# original
#randomString=$(mktemp --dry-run XXXXX)
#randomString="${randomString,,}"

# Bash 5.2 - friendly
randomString=$(mktemp -u XXXXX)
randomString=$(echo "$randomString" | tr '[:upper:]' '[:lower:]')

Azure Sample Estimation: AI Document Intelligence is picking up large SKU

Describe the bug
When using the AI Document Intelligence service to estimate the Azure sample costs, the service selects the large SKU option by default. This results in an overestimation of the Azure costs for sandbox environment.

To Reproduce

Expected behavior
The expected behavior is that the service should either select the default or the smallest SKU option, or prompt the user to choose the SKU size if the document does not provide it.

Screenshots

Additional context
Add any other context about the problem here.

Exception in utilities.py/build_chunks

There is an exception being raised here when processing a non-PDF document, as document_map['structure'][0]["subtitle"] is only set in build_document_map_pdf not build_document_map_html

I have fixed this in my code as below:

section = ''
subtitle = ''
title = ''

for tag in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'table']):
	if tag.name in ['h3', 'h4', 'h5', 'h6']:
		section = tag.get_text(strip=True)
	elif tag.name == 'h2':
		subtitle = tag.get_text(strip=True)
	elif tag.name == 'h1':
		title = tag.get_text(strip=True)
	elif tag.name == 'p' and tag.get_text(strip=True):
		document_map["structure"].append({
			"type": "text", 
			"text": tag.get_text(strip=True),
			"title": title,
			"subtitle": subtitle,
			"section": section,
			"page_number": 1                
			})
	elif tag.name == 'table' and tag.get_text(strip=True):
		document_map["structure"].append({
			"type": "table", 
			"text": str(tag),
			"title": title,
			"subtitle": subtitle,
			"section": section,
			"page_number": 1                
			})

Deployment fails with a media service error in northcentralus

FYI - when I tried to deploy to LOCATION="northcentralus" I got the following error for the media service:

{
    "status": "Failed",
    "error": {
        "code": "BadRequest",
        "message": "Creation of new Media Service accounts are not allowed as the resource has been deprecated.",
        "details": [
            {
                "code": "MediaServiceAccountCreationDisabled",
                "message": "Creation of new Media Service accounts are not allowed as the resource has been deprecated."
            }
        ]
    }
}

Changing to LOCATION="eastus" deployed without errors.

InvalidTemplate - Deployment template validation failed - During infrastructure deploy.

Describe the bug

The configuration value of bicep.use_binary_from_path has been set to 'false'.
Successfully installed Bicep CLI to "/home/vscode/.azure/bin/bicep".
InvalidTemplate - Deployment template validation failed: 'The template resource 'infoasst-aoai-9qt5q/' for type 'Microsoft.CognitiveServices/accounts/deployments' at line '1' and column '2146' has incorrect segment lengths. A nested resource type must have identical number of segments as its resource name. A root resource type must have segment length one greater than its resource name. Please see https://aka.ms/arm-syntax-resources for usage details.'.
make: *** [Makefile:18: infrastructure] Error 1

To Reproduce
Steps to reproduce the behavior:

  1. az login
  2. make build
  3. make deploy

Expected behavior
Expected to see a successful depoloy return result

Screenshots

image

Desktop (please complete the following information):
OS: Codespace - Debian GNU/Linux 11 (bullseye),
Host OS: Ubuntu 22.04.02 LTS
Browser: Version 114.0.5735.198 (Official Build) (64-bit)

Alpha version details

  • GitHub branch: main
  • Latest commit: [obtained by running git log -n 1 <branchname>
@fitchtravis ➜ /workspaces/PubSec-Info-Assistant (main) $ git log -n 1 main
commit b5dc8b368758b479b4f94b061b43c4b8ac94cb00 (HEAD -> main, origin/main, origin/HEAD)
Merge: ca99857 797ff84
Author: dayland <[email protected]>
Date:   Fri Jun 23 08:11:16 2023 +0000

    Merge pull request #101 from microsoft/dayland/5707-missing-search-indexer-property
    
    add "allowSkillsetToReadFileData" property to search indexer

Additional context
Happens during a make infrastructure

List of CI/CD pipeline parameters are missing values

Describe the bug
When trying to set up an Azure DevOps pipeline using the azdo.yaml pipeline, not all parameters required are documented at https://github.com/microsoft/PubSec-Info-Assistant/blob/main/pipelines/ci-cd%20pipelines.md.

Expected behavior
All pipeline parameters required should be documented.

Screenshots

Desktop (please complete the following information):

  • OS: Windows 11
  • Browser n/a
  • Version 0.3-Gamma

Alpha version details

  • GitHub branch: main
  • Latest commit: 0.3-Gamma

PDF Tables with RowSpan and ColSpan not interpreted correctly

Describe the bug
I have attached a PDF (publicly available) here. On Page 3 there is a table for VA pension benefits. For the question 'Can you tell me the full eligibility rules for receiving VA pension", the answer returned is incorrect. I have attached a screenshot of the issue.

To Reproduce
Steps to reproduce the behavior:

  1. Install the vNext-Dev version of the Information Assistant using bge embedding model
  2. Upload the attached pdf file
  3. Ask the question - "Can you tell me the full eligibility rules for receiving VA pension" and see the error as indicated in the attachment below.

Expected behavior
A clear explanation of legibility conditions that should match the ALL the content in Page 3 of the attached document.

Screenshots

Table Parsing

Alpha version details

  • GitHub branch: vNext-Dev

Additional context
summaryofvanationalguardandreserve.pdf

Running the App Locally - Local Development

The app is deployed and working fine. But when tried to run it locally from the 'backend' folder, it show an error like this:

Traceback (most recent call last): File "/workspaces/PubSec-Info-Assistant/app/backend/app.py", line 58, in <module> azure_search_key_credential = AzureKeyCredential(AZURE_SEARCH_SERVICE_KEY) File "/home/vscode/.local/lib/python3.10/site-packages/azure/core/credentials.py", line 67, in __init__ raise TypeError("key must be a string.") TypeError: key must be a string.

Please note that, infrastructure.env file is created with AZURE_SEARCH_SERVICE_KEY and other infrastructure credentials. Also, on a side note, some 'AllMetrics', 'AppServicePlatformLogs' and 'AppServiceAppLogs' were disabled during deployment for some conflict/error potentially due to subscription.

"Error: Access denied due to invalid subscription key or wrong API endpoint."

After provisioning the PubSec Accelerator as per the installation instructions, the chat function doesn't work and returns an error "Error: Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource."

I receive the same error when testing a solution deployed into Australia East and also East US.

I am able to login into the Azure Open AI Studio using the same instance that was provisioned and use the Chat Feature in there with any issues.

Screenshot 2023-08-16 at 8 16 27 am

In the configuration section of the WebApp by default the Service details are blank
{
"name": "AZURE_OPENAI_SERVICE",
"value": "",
"slotSetting": false
},
{
"name": "AZURE_OPENAI_SERVICE_KEY",
"value": "",
"slotSetting": false
},
but appear to be correct, as when I set these to either the end created or an existing end point, with the key value different errors are displayed on querying the chat:

Error: Error communicating with OpenAI: HTTPSConnectionPool(host='https', port=443): Max retries exceeded with url: //infoasst-aoai-w8ilp.openai.azure.com/.openai.azure.com//openai/deployments/gpt-35-turbo/completions?api-version=2023-06-01-preview (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x761c54bd59f0>: Failed to resolve 'https' ([Errno -2] Name or service not known)"))

Error: The completion operation does not work with the specified model, gpt-35-turbo. Please choose different model and try again. You can learn more about which models can be used with each operation here: https://go.microsoft.com/fwlink/?linkid=2197993.

In both instances above I had used the 'What impact does China have on climate changes' as the query.

FileFormRecSubmissionPDF - An error occurred - 'Response' object has no attribute 'code'

The FileFormRecSubmissionPDF fails to complete and accesses a non existant attribute from the HTTPS response object

To Reproduce
Steps to reproduce the behavior:

  1. Clone the Beta branch
  2. Fill in local.env
  3. make deploy
  4. Upload files to upload container
  5. Open CosmosDB data explorer
  6. Wait until status comes through with error on FileFormRecSubmissionPDF stage

Expected behavior
The FileFormRecSubmissionPDF should complete successfully or reque after calling the Form Recogniser API

Screenshots
image

Beta version details

  • GitHub branch: 0.2-Beta
  • Latest commit: commit e477742 (HEAD -> 0.2-Beta, origin/0.2-Beta)

Additional context
Suspected offending line of code:

statusLog.upsert_document(blob_path, f'{function_name} - Error on PDF submission to FR - {response.code} {response.message}', StatusClassification.ERROR, State.ERROR)

statusLog.upsert_document(blob_path, f'{function_name} - Error on PDF submission to FR - {response.code} {response.message}', StatusClassification.ERROR, State.ERROR)

Action required: migrate or opt-out of migration to GitHub inside Microsoft

Migrate non-Open Source or non-External Collaboration repositories to GitHub inside Microsoft

In order to protect and secure Microsoft, private or internal repositories in GitHub for Open Source, which are not related to open source projects or requiring collaboration with 3rd parties (customer, partners, etc.) must be migrated to GitHub inside Microsoft a.k.a GitHub Enterprise Cloud with Enterprise Managed User (GHEC EMU).

Action

✍️ Please RSVP to opt-in or opt-out of the migration to GitHub inside Microsoft.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived.🔒

Instructions

Reply with a comment on this issue containing one of the following optin or optout command options below.

✅ Opt-in to migrate

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

OR

❌ Opt-out of migration

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

  • staging : My project will ship as Open Source
  • collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete : This repository will be deleted because it is no longer needed.
  • other : Other reasons not specified

Need more help? 🖐️

Adding html-files with inline images (base64-encoded) pollutes search index

Describe the bug
I'm trying to add html-files with images by specifying the image data as base64-encoded in the src-attributes of the image tag.

While this works for the document preview, it seems that the image data is included in the search index where I don't think it will be of any use and probably cause a lot of noise.

Example of document (cropped) from the search index:

 "translated_text": "{\n ...<img data-sp-prop-name=\\\"imageSource\\\" src=\\\"data:image/png;base64,  iVBORw0KGgoAAAANSUhEUgAAA68AAAIICAYAAACB9tgLAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMA...

I chose this approach as I believe this is what Mammoth does with docx-files.
If this is not how images should be supplied, please advise.

To Reproduce
Steps to reproduce the behavior:
Upload html with embedded, inline images.

Expected behavior
Images with be show in the document preview. Search index will be lean not contain the base64-encoded data.

Alpha version details

  • GitHub branch: Main / gamma
  • Latest commit: deb6408

Additional context
Add any other context about the problem here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.