For the Azure Data Lake homepage go here:
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Samples and Docs for Azure Data Lake Store and Analytics
Home Page: http://aka.ms/AzureDataLake
License: MIT License
For the Azure Data Lake homepage go here:
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
Hi all,
When trying to compile the solution following two namespaces cannot be resolved:
using Microsoft.Azure.Subscriptions;
using Microsoft.Azure.Subscriptions.Csm;
I see, that some packages ware added few days ago.
Unfortunately the SDK ZIP file does not contain that package.
https://github.com/MicrosoftBigData/AzureDataLake/releases/download/PowerShellSDK_September10/Azure_SDK_DataLakeOnly.zip
Thanks
Damir
Consider the following USQL (reflecting the issue we've observed in our production development):
@data =
SELECT
*
FROM (VALUES
("A",(decimal?)2, "LabelA","A:1"),
("A",(decimal?)null,"LabelA","A:2"),
("A",(decimal?)1, "LabelA","A:3"),
("B",(decimal?)4, "LabelB","B:1")) AS T(Name, Value, Type, Id);
@result =
SELECT
Name,
SUM(Value) AS Sum,
Type,
AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
FROM @data
GROUP BY Name, Type;
OUTPUT @result TO "/res.csv" USING Outputters.Csv(outputHeader:true);
The aggregator in this case is the the sample custom aggregator from the USQL reference doc (our production code is different, but the problem is demonstrable with the sample UDAGG code):
using Microsoft.Analytics.Interfaces;
namespace AggTest
{
public class genericAggregator : IAggregate<string, string, string>
{
string AggregatedValue;
public override void Init()
{
AggregatedValue = "";
}
public override void Accumulate(string ValueToAgg, string GroupByValue)
{
AggregatedValue += ValueToAgg + ",";
}
public override string Terminate()
{
// remove last comma
return AggregatedValue.Substring(0, AggregatedValue.Length - 1);
}
}
}
When executed, either within an ADLA instance in Azure or using the USQL local run environment within Visual Studio 2017, the result is:
"Name","Sum","Type","RowId"
"A",,"LabelA","A,A,A"
"B",4,"LabelB","B"
The built-in USQL SUM aggregator has returned NULL rather than the expected output of 3 for the row with Name A. Removing the call to the user-defined aggregator returns a rowset with the expected SUM aggregation value of 3:
"Name","Sum","Type"
"A",3,"LabelA"
"B",4,"LabelB"
This is clearly inconsistent behaviour for the SUM aggregate which shouldn't care if a UDAGG is included in the processing of the same group.
Interestingly, if the @Result query is modified to:
@result =
SELECT
Name,
AVG(Value) AS Avg,
SUM(Value) AS Sum,
Type,
AGG<AggTest.genericAggregator>(Name, string.Empty) AS RowId
FROM @data
GROUP BY Name, Type;
Then the output is:
"Name","Avg","Sum","Type","RowId"
"A",1.5,3,"LabelA","A,A,A"
"B",4,4,"LabelB","B"
In this case the introduction of the AVG aggregator both produces the expected average as well as coaxing the SUM aggregator into also producing the correct answer!
At present we've implemented two workarounds:
SUM(Value ?? 0.0m) AS Sum
Obviously neither work around is ideal as we may care that the result of a SUM aggregate is actually NULL if there were no non-NULL values to sum in the group.
We have an ADLS Gen2 file system with several folders. Down to the bottom leaf folder, the owner and mask have r | w | e permissions.
However, when a file is landed in that folder, the execute permission is being lost causing a warning where access permissions for other entities are beyond the bounds of the mask. It is unclear why the owner and mask are not inheriting execute permissions on the file.
Why is the mask and owner not inheriting execute permissions, in many cases (like when a folder is generated) this locks down files.
Any chance that you will be providing nuget packages for the core Microsoft.Analytics.xxx assemblies rather than having to install tooling to get at them?
I would very much like a super easy way to get the first 10 lines or the last 10 lines of a file in data lake from PowerShell, similar to the unix head and tail commands.
Hi,
Today I've downloaded the latest release from the GitHub repository to notice that some cmdlets are failing on my machine and my colleagues where i.e. Register-AzureProvider -ProviderNamespace "Microsoft.DataLake"
and Get-AzureResourceGroup
from the getting started are not know anymore.
I've removed the official Azure PowerShell cmdlets and only the the ADL release installed.
Is there something I'm doing wrong?
I recently refreshed my ADL tools on my local environment, but I'm having issues with the built-in outputter and headers. No headers are printed with the data. I ran the same script in the cloud environment and headers were successfully published.
I've tried removing and reinstalling the ADL tools several times, but still no luck. Is this a known issue with the Local Run SDK?
I have hundred of devices which sending messages to IoT Hub and I am trying to use data lake to process all these messages.
All the articles out there in internet shows uploading CSV files for processing. Is converting to json messages to CSV file is must before getting them processed by data lake engine? can't I process all the incoming json telemetry directly in azure data lake?
the following input stream do not exist c:\Users\mwstevens\AppData\Local\USQLDataRoot\Samples\Data\Searchlog.tsv
How do I load the Manifest Package?
Output from Deployment:
Deploying local : DataRoot = C:\Users\mwstevens\source\repos\USQLSampleApplication1\bin\Debug\DataRoot
Validating package
Initialize : Full path to package 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Initialize : 'C:\Users\mwstevens\AppData\Local\USQLDataRoot' is a directory
Initialize : Loading package manifest 'C:\Users\mwstevens\AppData\Local\USQLDataRoot\manifest.xml'
Initialize : no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Initialize : Microsoft.Analytics.CICD.Deploy.PackageInitException: no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
at Microsoft.Analytics.CICD.Deploy.Package.Initialize(String pathToPackage, CancellationToken cancelToken, String workingFolder)
Deployment failed with no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
Microsoft.Analytics.CICD.Deploy.PackageInitException: no manifest.xml in 'C:\Users\mwstevens\AppData\Local\USQLDataRoot'
at Microsoft.Analytics.CICD.Deploy.Package.Initialize(String pathToPackage, CancellationToken cancelToken, String workingFolder)
at Microsoft.Analytics.CICD.Deploy.Program.DoLocalDeployment()
at Microsoft.Analytics.CICD.Deploy.Program.Main(String[] args)
I have a requirement to crawl all ADLS (gen1) folders and ingest into a metadata DB for building a search solution. I am looking for the recommended and efficient way for crawling the ADLS folders. Can you please provide me some recommendations?
Hello,
I have an Azure Function that is triggered when a blob is uploaded to nested directory within a DataLake Gen2 storage container.
How is this done?
Tried so far:
It appears SAS tokens can only be generated at the first level of the DataLake (container level).
It appears Access Control Lists do not handle this scenario either.
How do we scope permissions to a DataLake namespace in a way compatible with Azure Functions?
Thank you
Hi! I see that there's a Utils.java class that has several useful upload methods in it. Do you think it'd be possible to make the constructor of that class public so that we can use java code to upload/download files? Thank you!
When reviewing the doc here: https://github.com/MicrosoftBigData/AzureDataLake/blob/master/docs/PowerShell/UserManual.md
it says that Import/Export-AzureDataLakeItem are used for upload and download. In Azure storage, I use Set/Get-AzureStorageBlobContent and does not use import/export. Is there a reason to not be consistent there?
Suppose I am processing all the CSV files in data lake and store it inside data lake as structured or unstructured way, now how do I expose these data to end users?
I mean, do I have to write Web API and link to Azure data lake databases?
I just created a new analytics service and attached it to my store.
As a result I noticed an immediate jump in space consumption (up to 3GB).
Iterating over the store, I see things like this,
That's like 1GB duplicated.
What would happen if I had more databases, further duplication?
Seem like a bug?
There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.
Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.
I have a question related to GDPR compliance needs to delete user data from data lake when user request to delete the account. Currently we are storing user data for data analytics in Azure Data lake with following configuration:
We are using de-Identified data lake approache to be inline with data privacy challenges by de-identifying and protecting sensitive information before it even enters a data lake. By minimizing the storage and use of personally identifiable information. So before storing data into data lake we are making data with random id. Is it still required to delete the non-personally identifiable information from data lake to be compliance to GDPR? If so, is there an efficient way to delete the user specific data from data lake as azure data lake store is an append-only file system. Data once committed cannot be erased or updated.
Please let me know if you need any further informations.
Thanks a lot for your help in advance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.