azure-contrib / azuredirectory Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 59.0 32.64 MB

A Lucene Directory Provider for Azure Blob Storage

License: Microsoft Public License

C# 100.00%

azuredirectory's People

Contributors

Stargazers

Watchers

Forkers

rmarinho purplepangolin joevanwanzeele abstractiondev injac mickdelaney pmgrove dhaneyim3 briansakhai vnisor counsellorben piedone jhashemi olegil supriyakumari17 sbalkum paulirwin hongtaoc modulexcite calebjenkins shazwazza christopherwithers carlwoodhouse llenroc nitinagwl ryuyu sanacommerce streetdiligenceinc maximpa manfredlange gregorypilar gabisonia brykneval richorama gkinsman angleszhaixd eduardosantosgit michelbalthazar hungdluit lizhanglong bravecobra paviad yohsii pursca hlkong323 oleomachado stantoxt 3hdeng wanibean jamesburton cmudadu pierluca sirmrdexter danielstout5

azuredirectory's Issues

StorageException with HttpStatusCode 412 should translate to LockObtainFailedException

At the moment if an IndexWriter cannot obtain a lock on an AzureDirectory this is typically signaled as a Microsoft.WindowsAzure.Storage.StorageException. Class AzureLock passes this through as-is, in other words AzureLock.Obtain() will pass on StorageException to the caller.

However, the implementation of Obtain(long lockWaitTimeout) in base class Lock does not expect an exception to be thrown by the implementation of the abstract and parameter less method Lock.Obtain(). Obtain(long lockWaitTimeout) expects a boolean from Obtain() indicating if the lock was successfully obtained. If the return value is false, then Lock.Obtain(long lockWaitTimeout) will throw an exception of type LockObtainFailedException.

The implementation of AzureLock should not pass on StorageException when a lock cannot be obtained. Instead it should return false to the caller, ie Lock.Obtain(long lockWaitTimeout). The caller would then be able to behave exactly the same as if a Directory other than AzureDirectory was used, e.g. SimpleFSDirectory.

This could potentially be implemented by checking the HttpStatusCode for 412 in method AzureLock._handleException(...) or alternatively in the catch-block of method AzureLock.Obtain().

This code could potentially be used as a test when run inside of Parallel.ForEach() with MaxDegreeOfParallelism set to at least 2.

Include the XML comments with the DLL

Currently the NuGet package only includes the DLL. Including its XML, containing inline comments would be better. Also the license is missing from the package.

Help for Mongo Implementation

Hi @richorama

Seems that we share a lot of tech stacks. I have working on an improvement of a indexer solution based on Orleans and then I found this project. I have taken the idea to implement a custom directory and "adapted" your code to MongoDB:

https://github.com/Squidex/squidex/tree/text-indexer/backend/src/Squidex.Domain.Apps.Entities.MongoDb/FullText

I have not done so much around locking as the code is only used in a grain. But if you have time it would be great if you can review it. Perhaps as part of this PR: Squidex/squidex#454

Does this lib support Lucene.Net Version 4.8.0-beta00005

Subfolders in directory structure brake the search index

I found a strange behavior using subfolders.
My structure has a single container with many subfolders, one for each user.
The first time the subfolders is created and filled everything works fine.
But once reloaded, initializing the azure directory again, the index is empty.
This doesn’t happen when I use root folder instead.
I checked the paths in the lib but everything looks correct.

Any suggestion?

thank you.

Add the ability to set root folder

It's nice that you can set the container where the index is stored but it would be better if you could also set the root folder of the index inside that too; i.e. add the ability to have multiple indices stored in the same container. This is needed for scenarios where you have many smaller indices and you don't want to have a separate container for all of them.

Fix vulnerabilities

There is currently a lease on the blob and no lease ID was specified in the request.

I've just updated my code with 270e041 to fix #20. Now I'm getting the following error:

The remote server returned an error: (412) There is currently a lease on the blob and no lease ID was specified in the request.

You can also get this error by running https://github.com/azure-contrib/AzureDirectory/blob/master/TestApp/Program.cs multiple times with the following app setting enabled: https://github.com/azure-contrib/AzureDirectory/blob/master/TestApp/app.config#L6

Move license back to MS-

The original version of this repo (before i forked at maintained it) was licensed under MS-PL.

This code is currently MIT.

I have been asked if I could move it back to MS-PL to enable a customer to use it, as it's ambiguous whether the MS-PL license permits a fork to switch to MIT.

Does anyone have any thoughts or objections?

Default Cache should not be file based

At present if the cacheDirectory parameter of the AzureDirectory constructor is null, AzureDirectory will default to creating a folder in the local file system.

While this may be fine for local development and for some deployment scenarios, it doesn't work if you deploy as an app service to Azure. You don't necessarily have write permissions.

I'd like to suggest that the default could be RAMDirectory() instead. Alternatively, AzureDirectory should not make an assumption that write permission exists for the local file system. Instead AzureDirectory should require the explicit specification of a cache directory object.

In our case we experienced a short downtime in one of our production deployments that would have been avoidable. We resolved it by passing "new RAMDirectory()" as the value for parameter cacheDirectory to the constructor of AzureDirectory().

The remote server returned an error: (404) Not Found. at blob.FetchAttributes();

Hello,

The error message is the same as in #20, but now the error happens at another line.

blob.FetchAttributes();

Here is the stack trace:

The remote server returned an error: (404) Not Found.

Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.

Exception Details: System.Net.WebException: The remote server returned an error: (404) Not Found.

Source Error:

Line 250: {
Line 251: var blob = _blobContainer.GetBlockBlobReference(_rootFolder + name);
Line 252: blob.FetchAttributes();
Line 253: return new AzureIndexInput(this, blob);
Line 254: }

Source File: d:\data\inetpub\new-heroes\Sources\Project.Web.Core\ExamineAzure\AzureDirectory.cs Line: 252

Stack Trace:

[WebException: The remote server returned an error: (404) Not Found.]
System.Net.HttpWebRequest.GetResponse() +1740
Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteSync(RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext) +1123

[StorageException: The remote server returned an error: (404) Not Found.]
Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteSync(RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext) +2644
Examine.Directory.AzureDirectory.AzureDirectory.OpenInput(String name) in d:\data\inetpub\new-heroes\Sources\Project.Web.Core\ExamineAzure\AzureDirectory.cs:252

[FileNotFoundException: segments_3]
Lucene.Net.Index.FindSegmentsFile.Run(IndexCommit commit) +1494
Lucene.Net.Index.DirectoryReader.Open(Directory directory, IndexDeletionPolicy deletionPolicy, IndexCommit commit, Boolean readOnly, Int32 termInfosIndexDivisor) +87
UmbracoExamine.UmbracoExamineSearcher.OpenNewReader() +67
Examine.LuceneEngine.Providers.LuceneSearcher.ValidateSearcher(Boolean forceReopen) in X:\Projects\Examine\Examine\Projects\Examine\LuceneEngine\Providers\LuceneSearcher.cs:288

I hope it's just as easy to fix as the last time ;-).

I've also reported it here: https://our.umbraco.org/forum/extending-umbraco-and-using-the-api/78818-using-azuredirectory-with-examine#comment-252215

Support for Lucene.Net 4.8

Is there any way to upgrade this project to work with the Lucene.Net 4.8.0?

Unable to initiate Azuredirectory object

getting error
System.FormatException: 'Settings must be of the form "name=value".'

AzureLock Needs to Handle 409

Had this bite me at work today and I fixed it easily, however you should update your code. Azure will sometimes return a 409 on the lock obtain of Lucene, which will timeout the lock.

Current code:

    private bool _handleWebException(ICloudBlob blob, StorageException err)
    {
        if (err.RequestInformation.HttpStatusCode == 404)
        {
            _azureDirectory.CreateContainer();
            using (var stream = new MemoryStream())
            using (var writer = new StreamWriter(stream))
            {
                writer.Write(_lockFile);
                blob.UploadFromStream(stream);
            }
            return true;
        }
        return false;
    }

Fixed (and tested) code:

    private bool _handleWebException(ICloudBlob blob, StorageException err)
    {
        if (err.RequestInformation.HttpStatusCode == 404 || err.RequestInformation.HttpStatusCode == 409)
        {
            _azureDirectory.CreateContainer();
            using (var stream = new MemoryStream())
            using (var writer = new StreamWriter(stream))
            {
                writer.Write(_lockFile);
                blob.UploadFromStream(stream);
            }
            return true;
        }
        return false;
    }

Index Cache File Updating Continuously

See this bugreport: https://orchardazureindexing.codeplex.com/workitem/1

Consider changing the name

AzureDirectory is a bit overloaded. Perhaps AzureLuceneDirectory?

Any thoughts?

_name is always null when calling GrabMutex in AzureIndexOutput

This seems a bit odd but maybe it's by design but in the ctor of AzureIndexOutput the first line is:

_fileMutex = BlobMutexManager.GrabMutex(_name);

but the _name will always be null in this case because it hasn't been initialized yet. Is this intended?

DeleteFile usage is incorrect

Hi,

I have a port of this in my project called Examine: https://github.com/shazwazza/examine I've discovered that the logic in the DeleteFile method is incorrect because of how Lucene deals with files (it's pretty odd).

So DeleteFile is called by Lucene's IndexFileDeleter which is expecting an IOException to be thrown when the file cannot be deleted. This will actually happen in AzureDirectory when it does _cacheDirectory.DeleteFile(name); because that file will still be in use by a potential reader/searcher. Lucene will then retry this file deletion when it needs to refresh readers/searchers BUT since the file will be successfully removed from Blob storage, it will never retry removing it from local storage because the FileExists method will tell Lucene the file no longer exists because it doesn't exist in Blob storage.

This means that local storage will get fuller and fuller because old index files are actually not deleted. This is what the method should look like

public override void DeleteFile(System.String name)
{
    //We're going to try to remove this from the cache directory first,
    // because the IndexFileDeleter will call this file to remove files 
    // but since some files will be in use still, it will retry when a reader/searcher
    // is refreshed until the file is no longer locked. So we need to try to remove 
    // from local storage first and if it fails, let it keep throwing the IOExpception
    // since that is what Lucene is expecting in order for it to retry.
    //If we remove the main storage file first, then this will never retry to clean out
    // local storage because the FileExist method will always return false.
    try
    {
        if (_cacheDirectory.FileExists(name + ".blob"))
        {
            _cacheDirectory.DeleteFile(name + ".blob");
        }

        if (_cacheDirectory.FileExists(name))
        {
            _cacheDirectory.DeleteFile(name);
        }
    }
    catch (IOException ex)
    {
        //This will occur because this file is locked, when this is the case, we don't really want to delete it from the master either because
        // if we do that then this file will never get removed from the cache folder either! This is based on the Deletion Policy which the
        // IndexFileDeleter uses. We could implement our own one of those to deal with this scenario too but it seems the easiest way it to just 
        // let this throw so Lucene will retry when it can and when that is successful we'll also clear it from the master
        throw;
    }

    //if we've made it this far then the cache directly file has been successfully removed so now we'll do the master
            
    var blob = _blobContainer.GetBlockBlobReference(_rootFolder + name);
    blob.DeleteIfExists();
    Debug.WriteLine(String.Format("DELETE {0}/{1}", _blobContainer.Uri.ToString(), name));
}

Speed of adding document to azure directory.

Hi. I have some code:
var cloudAccount = CloudStorageAccount.Parse("someAccount");
var cacheDirectory = new RAMDirectory();
var azureDirectory = new AzureDirectory(cloudAccount, "fulltextcloudindex", cacheDirectory);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

var indexWriter = new IndexWriter(azureDirectory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
var luceneDataProvider = new LuceneDataProvider(azureDirectory, Version.LUCENE_30, IndexWriter);

using (var sessionArticle = LuceneDataProvider.OpenSession())
{
sessionArticle.Add(entity);
}

When I try to step out from using.. it gets about 10 seconds. It's really slow. Or maybe I need to do it in another way?

The output:
COMPRESSED 201 -> 168 83.58209% to _2g.fdt
PUT 168 bytes to _2g.fdt in cloud
CLOSED WRITESTREAM _2g.fdt
COMPRESSED 12 -> 9 75% to _2g.fdx
PUT 9 bytes to _2g.fdx in cloud
CLOSED WRITESTREAM _2g.fdx
COMPRESSED 348 -> 230 66.09196% to _2g.tis
PUT 230 bytes to _2g.tis in cloud
CLOSED WRITESTREAM _2g.tis
COMPRESSED 35 -> 26 74.28571% to _2g.tii
PUT 26 bytes to _2g.tii in cloud
CLOSED WRITESTREAM _2g.tii
COMPRESSED 28 -> 12 42.85714% to _2g.frq
AzureLock:Renew(write.lock : fe28a1f6-2597-4383-85f8-5efe27c23534
PUT 12 bytes to _2g.frq in cloud
CLOSED WRITESTREAM _2g.frq
COMPRESSED 20 -> 17 85% to _2g.prx
PUT 17 bytes to _2g.prx in cloud
CLOSED WRITESTREAM _2g.prx
COMPRESSED 16 -> 12 75% to _2g.nrm
PUT 12 bytes to _2g.nrm in cloud
CLOSED WRITESTREAM _2g.nrm
PUT 155 bytes to _2g.fnm in cloud
CLOSED WRITESTREAM _2g.fnm
opening _2g.fdt
Using cached file for _2g.fdt
CLOSED READSTREAM local _2g.fdt
opening _2g.fdx
Using cached file for _2g.fdx
CLOSED READSTREAM local _2g.fdx
opening _2g.tis
Using cached file for _2g.tis
CLOSED READSTREAM local _2g.tis
opening _2g.tii
Using cached file for _2g.tii
CLOSED READSTREAM local _2g.tii
opening _2g.frq
Using cached file for _2g.frq
CLOSED READSTREAM local _2g.frq
opening _2g.prx
Using cached file for _2g.prx
CLOSED READSTREAM local _2g.prx
opening _2g.nrm
Using cached file for _2g.nrm
CLOSED READSTREAM local _2g.nrm
opening _2g.fnm
Using cached file for _2g.fnm
CLOSED READSTREAM local _2g.fnm
COMPRESSED 944 -> 528 55.93221% to _2g.cfs
PUT 528 bytes to _2g.cfs in cloud
CLOSED WRITESTREAM _2g.cfs
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2g.fnm
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2g.frq
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2g.prx
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2g.tis
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2g.tii
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2g.nrm
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2g.fdx
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2g.fdt
opening _2g.cfs
Using cached file for _2g.cfs
Creating clone for _2g.cfs
CLOSED READSTREAM local
Creating clone for _2g.cfs
Creating clone for _2g.cfs
CLOSED READSTREAM local
Creating clone for _2g.cfs
Creating clone for _2g.cfs
Creating clone for _2g.cfs
Creating clone for
CLOSED READSTREAM local
Creating clone for
Creating clone for
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local _2g.cfs
AzureLock:Renew(write.lock : fe28a1f6-2597-4383-85f8-5efe27c23534
PUT 212 bytes to segments_2x in cloud
CLOSED WRITESTREAM segments_2x
PUT 20 bytes to segments.gen in cloud
CLOSED WRITESTREAM segments.gen
DELETE https://company.blob.core.windows.net/fulltextcloudindex/segments_2w
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2d.cfs
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2d_1.del
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2e.cfs
DELETE https://company.blob.core.windows.net/fulltextcloudindex/_2f.cfs
opening segments.gen
Using cached file for segments.gen
CLOSED READSTREAM local segments.gen
opening segments_2x
Using cached file for segments_2x
CLOSED READSTREAM local segments_2x
opening _2g.cfs
Using cached file for _2g.cfs
Creating clone for _2g.cfs
CLOSED READSTREAM local
Creating clone for _2g.cfs
Creating clone for _2g.cfs
CLOSED READSTREAM local
Creating clone for _2g.cfs
Creating clone for _2g.cfs
Creating clone for _2g.cfs
Creating clone for _2g.cfs
Creating clone for
Creating clone for
Creating clone for _2g.cfs
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local _2d.cfs
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local _2e.cfs
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local
CLOSED READSTREAM local _2f.cfs

Spellchecker failure?

Hello All,

Has any body successfully created a spellchecker index on Azure with the index stored on the Azure directory itself?

I am using this class to create the spell checker index but it hangs and never completes. On the other hand, the same code works with a local directory.

SpellChecker.Net.Search.Spell.SpellChecker

Best Regards and Thanks in Advance,

Support .net 4.0

Hi,
Is it possible to add support of .net 4.0

In out example we are using Umbraco 6.* and we can't migrate out solution to 4.5

I checked and if to replace Mutex.TryOpenExisting with custom reslization, then it should be possible to run under 4.0

Best regards,
Alexey Badyl

Get rid of exceptions that are thrown during normal operation

First of all, thank you for taking care of the project!

I see that there are many exceptions in normal operation too. There are many related to Blob handling when creating and index, however more problematic, as happening far more is that BlobMutexManager operates with exceptions too. It causes about 6 exceptions for running a search query.

Couldn't this class be modified to use Mutex.TryOpenExisting() instead?

Thank you.

IsLocked doesn't work properly

AzureLock.IsLocked method returns incorrect result.
If directory is locked it returns FALSE, if not locked it returns TRUE.

Looks like this line has wrong condition:
https://github.com/azure-contrib/AzureDirectory/blob/master/AzureDirectory/AzureLock.cs#L43

See example code below, which outputs to:

//Console output
Writer created.
IsLocked: False
Writer disposed
IsLocked: True
Writer 2 created.
IsLocked: False
Directory unlocked
IsLocked: True

var cloudAccount = GetStorageAccount();

var localCache = new DirectoryInfo("c:\\isLockedTest.txt");
var azureContainer = "azurelocktest";

var cacheDirectory = new SimpleFSDirectory(localCache);
var azureDirectory =
    new AzureDirectory(cloudAccount, azureContainer, cacheDirectory);

var analyzer = new StandardAnalyzer(Version.LUCENE_30);

var indexWriter = new IndexWriter(azureDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);

Console.WriteLine("Writer 2 created.");
Console.WriteLine($"IsLocked: {IndexWriter.IsLocked(azureDirectory)}");
indexWriter.Dispose();
Console.WriteLine("Writer disposed");
Console.WriteLine($"IsLocked: {IndexWriter.IsLocked(azureDirectory)}");         

indexWriter = new IndexWriter(azureDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);

Console.WriteLine("Writer 2 created.");
Console.WriteLine($"IsLocked: {IndexWriter.IsLocked(azureDirectory)}");
IndexWriter.Unlock(azureDirectory);
Console.WriteLine("Directory unlocked");
Console.WriteLine($"IsLocked: {IndexWriter.IsLocked(azureDirectory)}");

The remote server returned an error: (404) Not Found.

I'm using AzureDirectory and I'm getting this error: The remote server returned an error: (404) Not Found.

More info in this topic: https://our.umbraco.org/forum/extending-umbraco-and-using-the-api/78818-using-azuredirectory-with-examine

CreateIfNotExists fails in AzureDirectory

There are versions of Windows.Azure.Storage that do not implement the "CreateIfNotExists" function syncronously, they only use the "CreateIfNotExistsAsync" method.

AzureDirectory.cs should have:

public void CreateContainer()
{
  _blobContainer = _blobClient.GetContainerReference(_containerName);
  _blobContainer.CreateIfNotExistsAsync().Wait();
}

Don't require specific verision of Azure libraries

If you add AzureDirectory to a project already using Azure libs (like WindowsAzure.Storage) then you can't use the AzureDirectory package but have to compile it yourself as there will be an error because of mismatching DLLs. Even a few days ago there came out a new WindowsAzure.Storage version (3.0.3) but AzureDirectory uses the previous version.

Remove DeflateAdapter.cs from repository

As it's orphaned.