Giter Site home page Giter Site logo

azuredatalake's Introduction

azuredatalake's People

Contributors

a-tan avatar begoldsm avatar cjcorb avatar dagiro avatar eloldag avatar gitramm avatar joelhulen avatar jrjlee avatar matt1883 avatar microsoft-github-policy-service[bot] avatar mikerys avatar myamama avatar nitinme avatar omafnan avatar omkarksa10 avatar sachincsheth avatar teitelberg avatar tomkerkhove avatar warnov avatar xujxu avatar yangyud-cn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

azuredatalake's Issues

USQL SUM aggregator returns null when used with user-defined aggregators

Consider the following USQL (reflecting the issue we've observed in our production development):

@data = 
    SELECT
        *
    FROM (VALUES
        ("A",(decimal?)2,   "LabelA","A:1"),
        ("A",(decimal?)null,"LabelA","A:2"),
        ("A",(decimal?)1,   "LabelA","A:3"),
        ("B",(decimal?)4,   "LabelB","B:1")) AS T(Name, Value, Type, Id);

@result = 
    SELECT 
        Name,
        SUM(Value) AS Sum,
        Type,
        AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
    FROM @data
    GROUP BY Name, Type;

OUTPUT @result TO "/res.csv" USING Outputters.Csv(outputHeader:true);

The aggregator in this case is the the sample custom aggregator from the USQL reference doc (our production code is different, but the problem is demonstrable with the sample UDAGG code):

using Microsoft.Analytics.Interfaces;

namespace AggTest
{
    public class genericAggregator : IAggregate<string, string, string>
    {
        string AggregatedValue;

        public override void Init()
        {
            AggregatedValue = "";
        }

        public override void Accumulate(string ValueToAgg, string GroupByValue)
        {
            AggregatedValue += ValueToAgg + ",";
        }

        public override string Terminate()
        {
            // remove last comma
            return AggregatedValue.Substring(0, AggregatedValue.Length - 1);
        }
    }
}

When executed, either within an ADLA instance in Azure or using the USQL local run environment within Visual Studio 2017, the result is:

"Name","Sum","Type","RowId"
"A",,"LabelA","A,A,A"
"B",4,"LabelB","B"

The built-in USQL SUM aggregator has returned NULL rather than the expected output of 3 for the row with Name A. Removing the call to the user-defined aggregator returns a rowset with the expected SUM aggregation value of 3:

"Name","Sum","Type"
"A",3,"LabelA"
"B",4,"LabelB"

This is clearly inconsistent behaviour for the SUM aggregate which shouldn't care if a UDAGG is included in the processing of the same group.
Interestingly, if the @Result query is modified to:

@result = 
    SELECT 
        Name,
        AVG(Value) AS Avg,
        SUM(Value) AS Sum,
        Type,
        AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
    FROM @data
    GROUP BY Name, Type;

Then the output is:

"Name","Avg","Sum","Type","RowId"
"A",1.5,3,"LabelA","A,A,A"
"B",4,4,"LabelB","B"

In this case the introduction of the AVG aggregator both produces the expected average as well as coaxing the SUM aggregator into also producing the correct answer!
At present we've implemented two workarounds:

  1. Use the null coalescing operator within the SUM, i.e.:
SUM(Value ?? 0.0m) AS Sum
  1. Filter the rowset to ensure that NULLs in the field to sum aren't included in the group (this has the disadvantage that if more than one SUM aggregator is used in a single query and each of the fields may have NULLs then this adds complication).

Obviously neither work around is ideal as we may care that the result of a SUM aggregate is actually NULL if there were no non-NULL values to sum in the group.

File not inheriting execute permissions

We have an ADLS Gen2 file system with several folders. Down to the bottom leaf folder, the owner and mask have r | w | e permissions.

image

However, when a file is landed in that folder, the execute permission is being lost causing a warning where access permissions for other entities are beyond the bounds of the mask. It is unclear why the owner and mask are not inheriting execute permissions on the file.

image

Why is the mask and owner not inheriting execute permissions, in many cases (like when a folder is generated) this locks down files.

Issue with PoSH cmdlets in 1st of July release?

Hi,

Today I've downloaded the latest release from the GitHub repository to notice that some cmdlets are failing on my machine and my colleagues where i.e. Register-AzureProvider -ProviderNamespace "Microsoft.DataLake" and Get-AzureResourceGroup from the getting started are not know anymore.

PowerShell errros

I've removed the official Azure PowerShell cmdlets and only the the ADL release installed.

Is there something I'm doing wrong?

outputHeader only works in cloud

I recently refreshed my ADL tools on my local environment, but I'm having issues with the built-in outputter and headers. No headers are printed with the data. I ran the same script in the cloud environment and headers were successfully published.

I've tried removing and reinstalling the ADL tools several times, but still no luck. Is this a known issue with the Local Run SDK?

How to I process the telemetry json messages in Azure data lake?

I have hundred of devices which sending messages to IoT Hub and I am trying to use data lake to process all these messages.

All the articles out there in internet shows uploading CSV files for processing. Is converting to json messages to CSV file is must before getting them processed by data lake engine? can't I process all the incoming json telemetry directly in azure data lake?

[email protected]

the following input stream do not exist c:\Users\mwstevens\AppData\Local\USQLDataRoot\Samples\Data\Searchlog.tsv
How do I load the Manifest Package?
Output from Deployment:

Deploying local : DataRoot = C:\Users\mwstevens\source\repos\USQLSampleApplication1\bin\Debug\DataRoot
Validating package
Initialize : Full path to package 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Initialize : 'C:\Users\mwstevens\AppData\Local\USQLDataRoot' is a directory
Initialize : Loading package manifest 'C:\Users\mwstevens\AppData\Local\USQLDataRoot\manifest.xml'
Initialize : no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Initialize : Microsoft.Analytics.CICD.Deploy.PackageInitException: no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
at Microsoft.Analytics.CICD.Deploy.Package.Initialize(String pathToPackage, CancellationToken cancelToken, String workingFolder)
Deployment failed with no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Microsoft.Analytics.CICD.Deploy.PackageInitException: no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
at Microsoft.Analytics.CICD.Deploy.Package.Initialize(String pathToPackage, CancellationToken cancelToken, String workingFolder)
at Microsoft.Analytics.CICD.Deploy.Program.DoLocalDeployment()
at Microsoft.Analytics.CICD.Deploy.Program.Main(String[] args)

Help with crawling Azure Data Lake folders

I have a requirement to crawl all ADLS (gen1) folders and ingest into a metadata DB for building a search solution. I am looking for the recommended and efficient way for crawling the ADLS folders. Can you please provide me some recommendations?

SAS Token or ACL for DataLake directory (namespace) permissions?

Hello,

I have an Azure Function that is triggered when a blob is uploaded to nested directory within a DataLake Gen2 storage container.

  • I do not want to give the Function permissions on the entire DataLake (via connection string in Function app settings).
  • Instead, I need to scope the Function's credentials down to a single, nested namespace.

How is this done?

Tried so far:

  • It appears SAS tokens can only be generated at the first level of the DataLake (container level).

    • This does not work for my use-case as there is one top-level container for the entire DataLake (so in essence, the Function still has root access to the DataLake)
    • I want to scope permissions deeper, to the nested directory (namespace) level
  • It appears Access Control Lists do not handle this scenario either.

    • They don't grant a specific permission key/token/etc that can be used by a Function app setting
    • I'd still have to grant the Function full root access to the DataLake, then hope the ACL perm works
    • In addition, there is no GUI for managing ACL's (outside of Azure Storage Explorer for granting permissions) so they will inevitably be lost/forgotton/etc.

How do we scope permissions to a DataLake namespace in a way compatible with Azure Functions?

Thank you

Uploading and downloading files from/to local

Hi! I see that there's a Utils.java class that has several useful upload methods in it. Do you think it'd be possible to make the constructor of that class public so that we can use java code to upload/download files? Thank you!

How do I expose data out from Azure Data Lake?

Suppose I am processing all the CSV files in data lake and store it inside data lake as structured or unstructured way, now how do I expose these data to end users?

I mean, do I have to write Web API and link to Azure data lake databases?

Data Lake Analytics service wastes space

I just created a new analytics service and attached it to my store.
As a result I noticed an immediate jump in space consumption (up to 3GB).

Iterating over the store, I see things like this,
image

That's like 1GB duplicated.
What would happen if I had more databases, further duplication?

Seem like a bug?

For GDPR compliance how to delete user data from data lake efficiently?

I have a question related to GDPR compliance needs to delete user data from data lake when user request to delete the account. Currently we are storing user data for data analytics in Azure Data lake with following configuration:

  • Type: Data Lake Storage Gen1
  • Data format in Data lake: Avro
  • Using default partitioning based on time

We are using de-Identified data lake approache to be inline with data privacy challenges by de-identifying and protecting sensitive information before it even enters a data lake. By minimizing the storage and use of personally identifiable information. So before storing data into data lake we are making data with random id. Is it still required to delete the non-personally identifiable information from data lake to be compliance to GDPR? If so, is there an efficient way to delete the user specific data from data lake as azure data lake store is an append-only file system. Data once committed cannot be erased or updated.

Please let me know if you need any further informations.

Thanks a lot for your help in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.