azure / azuredatalake Goto Github PK

View Code? Open in Web Editor NEW

139.0 109.0 102.0 52.4 MB

Samples and Docs for Azure Data Lake Store and Analytics

Home Page: http://aka.ms/AzureDataLake

License: MIT License

big-data azure data-lake

azuredatalake's Introduction

For the Azure Data Lake homepage go here:

http://aka.ms/AzureDataLake

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

azuredatalake's People

Contributors

Stargazers

Watchers

Forkers

venk21 nitinme modulexcite smartpcr mandavavrao arindamc mikerys superpigbb arkulkarni-zz nagyistge a-tan annielytix kelewis smartfire ymanatos1 mdgouse2000 warnov xinyi-joffre begoldsm matt1883 salbert72 asears llenroc joelhulen johnkemnetz kwangsikjeon jonsing drorkremer setharsh myamama slunyakin-zz ampatha aswinamirtharaj mihailorama iamabhishekchakraborty srujan-shingirikonda gedittxt wguo123 omkarksa10 pavankvd dianz aatif191 damally touchtype leeohalloran teitelberg skepticatgit minzhang20171202 sathish-enqos lewu-msft dbbevan chakrs ayushtrivedi9810 iliaftk cjcorb iamnarendhar vlearning mbuihuynh sunilnandaeast wenwangting aipachakutiqwan ilovejazz gopaltayde salperwy swimmingmonkey omshiv2415 rajendrasingh2121 flaviodiasps lucasproclc erictz hamzamazhar dyaneshwaran fgariepy-relation1 sureshdontha yatinkch mfarhan1179 snmishra7 nkrishnakris1406 pearldeak azurementor jranpariya vikasptl07 sweco-fivanl xujxu venubattula amulmgr sylviavargas vasavisuryasetty isabella232 sudhirveerabhahu i-cat lingluo0531 vinodh247 shikalaba geeklurnai whhender rushtojp

azuredatalake's Issues

Solution does not compile

Hi all,

When trying to compile the solution following two namespaces cannot be resolved:

using Microsoft.Azure.Subscriptions;
using Microsoft.Azure.Subscriptions.Csm;

I see, that some packages ware added few days ago.

Unfortunately the SDK ZIP file does not contain that package.
https://github.com/MicrosoftBigData/AzureDataLake/releases/download/PowerShellSDK_September10/Azure_SDK_DataLakeOnly.zip

Thanks
Damir

USQL SUM aggregator returns null when used with user-defined aggregators

Consider the following USQL (reflecting the issue we've observed in our production development):

@data = 
    SELECT
        *
    FROM (VALUES
        ("A",(decimal?)2,   "LabelA","A:1"),
        ("A",(decimal?)null,"LabelA","A:2"),
        ("A",(decimal?)1,   "LabelA","A:3"),
        ("B",(decimal?)4,   "LabelB","B:1")) AS T(Name, Value, Type, Id);

@result = 
    SELECT 
        Name,
        SUM(Value) AS Sum,
        Type,
        AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
    FROM @data
    GROUP BY Name, Type;

OUTPUT @result TO "/res.csv" USING Outputters.Csv(outputHeader:true);

The aggregator in this case is the the sample custom aggregator from the USQL reference doc (our production code is different, but the problem is demonstrable with the sample UDAGG code):

using Microsoft.Analytics.Interfaces;

namespace AggTest
{
    public class genericAggregator : IAggregate<string, string, string>
    {
        string AggregatedValue;

        public override void Init()
        {
            AggregatedValue = "";
        }

        public override void Accumulate(string ValueToAgg, string GroupByValue)
        {
            AggregatedValue += ValueToAgg + ",";
        }

        public override string Terminate()
        {
            // remove last comma
            return AggregatedValue.Substring(0, AggregatedValue.Length - 1);
        }
    }
}

When executed, either within an ADLA instance in Azure or using the USQL local run environment within Visual Studio 2017, the result is:

"Name","Sum","Type","RowId"
"A",,"LabelA","A,A,A"
"B",4,"LabelB","B"

The built-in USQL SUM aggregator has returned NULL rather than the expected output of 3 for the row with Name A. Removing the call to the user-defined aggregator returns a rowset with the expected SUM aggregation value of 3:

"Name","Sum","Type"
"A",3,"LabelA"
"B",4,"LabelB"

This is clearly inconsistent behaviour for the SUM aggregate which shouldn't care if a UDAGG is included in the processing of the same group.
Interestingly, if the @Result query is modified to:

@result = 
    SELECT 
        Name,
        AVG(Value) AS Avg,
        SUM(Value) AS Sum,
        Type,
        AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
    FROM @data
    GROUP BY Name, Type;

Then the output is:

"Name","Avg","Sum","Type","RowId"
"A",1.5,3,"LabelA","A,A,A"
"B",4,4,"LabelB","B"

In this case the introduction of the AVG aggregator both produces the expected average as well as coaxing the SUM aggregator into also producing the correct answer!
At present we've implemented two workarounds:

Use the null coalescing operator within the SUM, i.e.:

SUM(Value ?? 0.0m) AS Sum

Filter the rowset to ensure that NULLs in the field to sum aren't included in the group (this has the disadvantage that if more than one SUM aggregator is used in a single query and each of the fields may have NULLs then this adds complication).

Obviously neither work around is ideal as we may care that the result of a SUM aggregate is actually NULL if there were no non-NULL values to sum in the group.

File not inheriting execute permissions

We have an ADLS Gen2 file system with several folders. Down to the bottom leaf folder, the owner and mask have r | w | e permissions.

However, when a file is landed in that folder, the execute permission is being lost causing a warning where access permissions for other entities are beyond the bounds of the mask. It is unclear why the owner and mask are not inheriting execute permissions on the file.

Why is the mask and owner not inheriting execute permissions, in many cases (like when a folder is generated) this locks down files.

NuGet packages for Microsoft.Analytics.xxx

Any chance that you will be providing nuget packages for the core Microsoft.Analytics.xxx assemblies rather than having to install tooling to get at them?

Support head / tail like functionality for data lake files

I would very much like a super easy way to get the first 10 lines or the last 10 lines of a file in data lake from PowerShell, similar to the unix head and tail commands.

Issue with PoSH cmdlets in 1st of July release?

Hi,

Today I've downloaded the latest release from the GitHub repository to notice that some cmdlets are failing on my machine and my colleagues where i.e. Register-AzureProvider -ProviderNamespace "Microsoft.DataLake" and Get-AzureResourceGroup from the getting started are not know anymore.

I've removed the official Azure PowerShell cmdlets and only the the ADL release installed.

Is there something I'm doing wrong?

outputHeader only works in cloud

I recently refreshed my ADL tools on my local environment, but I'm having issues with the built-in outputter and headers. No headers are printed with the data. I ran the same script in the cloud environment and headers were successfully published.

I've tried removing and reinstalling the ADL tools several times, but still no luck. Is this a known issue with the Local Run SDK?

How to I process the telemetry json messages in Azure data lake?

I have hundred of devices which sending messages to IoT Hub and I am trying to use data lake to process all these messages.

All the articles out there in internet shows uploading CSV files for processing. Is converting to json messages to CSV file is must before getting them processed by data lake engine? can't I process all the incoming json telemetry directly in azure data lake?

[email protected]

the following input stream do not exist c:\Users\mwstevens\AppData\Local\USQLDataRoot\Samples\Data\Searchlog.tsv
How do I load the Manifest Package?
Output from Deployment:

Deploying local : DataRoot = C:\Users\mwstevens\source\repos\USQLSampleApplication1\bin\Debug\DataRoot
Validating package
Initialize : Full path to package 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Initialize : 'C:\Users\mwstevens\AppData\Local\USQLDataRoot' is a directory
Initialize : Loading package manifest 'C:\Users\mwstevens\AppData\Local\USQLDataRoot\manifest.xml'
Initialize : no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Initialize : Microsoft.Analytics.CICD.Deploy.PackageInitException: no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
at Microsoft.Analytics.CICD.Deploy.Package.Initialize(String pathToPackage, CancellationToken cancelToken, String workingFolder)
Deployment failed with no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Microsoft.Analytics.CICD.Deploy.PackageInitException: no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
at Microsoft.Analytics.CICD.Deploy.Package.Initialize(String pathToPackage, CancellationToken cancelToken, String workingFolder)
at Microsoft.Analytics.CICD.Deploy.Program.DoLocalDeployment()
at Microsoft.Analytics.CICD.Deploy.Program.Main(String[] args)

Help with crawling Azure Data Lake folders

I have a requirement to crawl all ADLS (gen1) folders and ingest into a metadata DB for building a search solution. I am looking for the recommended and efficient way for crawling the ADLS folders. Can you please provide me some recommendations?

Broken links on First Steps page

URL for ADL Tools for Visual Studio plug-in is stale
Data Lake PowerShell URL results in 404 (not found)

SAS Token or ACL for DataLake directory (namespace) permissions?

Hello,

I have an Azure Function that is triggered when a blob is uploaded to nested directory within a DataLake Gen2 storage container.

I do not want to give the Function permissions on the entire DataLake (via connection string in Function app settings).
Instead, I need to scope the Function's credentials down to a single, nested namespace.

How is this done?

Tried so far:

It appears SAS tokens can only be generated at the first level of the DataLake (container level).
- This does not work for my use-case as there is one top-level container for the entire DataLake (so in essence, the Function still has root access to the DataLake)
- I want to scope permissions deeper, to the nested directory (namespace) level
It appears Access Control Lists do not handle this scenario either.
- They don't grant a specific permission key/token/etc that can be used by a Function app setting
- I'd still have to grant the Function full root access to the DataLake, then hope the ACL perm works
- In addition, there is no GUI for managing ACL's (outside of Azure Storage Explorer for granting permissions) so they will inevitably be lost/forgotton/etc.

How do we scope permissions to a DataLake namespace in a way compatible with Azure Functions?

Thank you

Uploading and downloading files from/to local

Hi! I see that there's a Utils.java class that has several useful upload methods in it. Do you think it'd be possible to make the constructor of that class public so that we can use java code to upload/download files? Thank you!

Inconsistency with Azure Storage Cmdlets

When reviewing the doc here: https://github.com/MicrosoftBigData/AzureDataLake/blob/master/docs/PowerShell/UserManual.md

it says that Import/Export-AzureDataLakeItem are used for upload and download. In Azure storage, I use Set/Get-AzureStorageBlobContent and does not use import/export. Is there a reason to not be consistent there?

How do I expose data out from Azure Data Lake?

Suppose I am processing all the CSV files in data lake and store it inside data lake as structured or unstructured way, now how do I expose these data to end users?

I mean, do I have to write Web API and link to Azure data lake databases?

Data Lake Analytics service wastes space

I just created a new analytics service and attached it to my store.
As a result I noticed an immediate jump in space consumption (up to 3GB).

Iterating over the store, I see things like this,

That's like 1GB duplicated.
What would happen if I had more databases, further duplication?

Seem like a bug?

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

For GDPR compliance how to delete user data from data lake efficiently?

I have a question related to GDPR compliance needs to delete user data from data lake when user request to delete the account. Currently we are storing user data for data analytics in Azure Data lake with following configuration:

Type: Data Lake Storage Gen1
Data format in Data lake: Avro
Using default partitioning based on time

We are using de-Identified data lake approache to be inline with data privacy challenges by de-identifying and protecting sensitive information before it even enters a data lake. By minimizing the storage and use of personally identifiable information. So before storing data into data lake we are making data with random id. Is it still required to delete the non-personally identifiable information from data lake to be compliance to GDPR? If so, is there an efficient way to delete the user specific data from data lake as azure data lake store is an append-only file system. Data once committed cannot be erased or updated.

Please let me know if you need any further informations.

Thanks a lot for your help in advance.