Giter Site home page Giter Site logo

matomo-org / matomo-log-analytics Goto Github PK

View Code? Open in Web Editor NEW
220.0 220.0 117.0 856 KB

Import any kind of server logs in Matomo for powerful log analytics. Universal log file parsing and reporting.

Home Page: https://matomo.org/log-analytics/

License: GNU General Public License v3.0

Shell 0.13% Python 99.87%

matomo-log-analytics's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

matomo-log-analytics's Issues

Importing w3c extended logs

Hi,

Is there any documentation as to how we can import custom formats using the import_logs.py script?

We have some IIS logs that fail using any of the log-format-name options.

I'm a little confused as to how to use the log regex. Our IIS logs are currently setup as follows:

#Fields: date time s-sitename s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs-version cs(User-Agent) cs(Referer) sc-status sc-substatus sc-win32-status sc-bytes 

Which produce log entries as such:

2014-06-03 05:14:44 W3SVC726003028 10.0.1.3 GET /index.html - 80 - 10.62.32.123 HTTP/1.1 Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/35.0.1916.114+Safari/537.36 - 304 0 0 344

Can you point me in the right direction with this?

Thanks,

Dan

Migrated from matomo-org/matomo#5418

import_logs.py fails to detect log type with multiple IPs in first line

When using the X-forwarded-for header for load-balanced sites or proxied traffic, it is possible for the webserver to record multiple IPs on a line. This appears to break the log detection of import_logs.py.

Broken example log:

218.108.232.188, 10.183.250.139 - - [17/Oct/2013:00:33:34 -0400] "GET /blog/ HTTP/1.0" 200 11714 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"
108.128.162.178 - - [17/Oct/2013:00:33:47 -0400] "GET / HTTP/1.1" 200 8040 "http://www.referringsite.com/news/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.26 Safari/537.36"

Stack trace:

Traceback (most recent call last):
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1575, in <module>
    main()
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1539, in main
    parser.parse(filename)
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1390, in parse
    format = self.detect_format(file)
  File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1349, in detect_format
    logging.debug('Format %s is the best match', format.name)
AttributeError: 'NoneType' object has no attribute 'name'

While a quick fix is to move any offending lines beyond a "good" line, this is not easily automated.

Modifying log above so script works:

108.128.162.178 - - [17/Oct/2013:00:33:47 -0400] "GET / HTTP/1.1" 200 8040 "http://www.referringsite.com/news/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.26 Safari/537.36"
218.108.232.188, 10.183.250.139 - - [17/Oct/2013:00:33:34 -0400] "GET /blog/ HTTP/1.0" 200 11714 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)"

I admit, my python-foo is not excellent but I may look over the weekend and try to patch the code. I believe the best option is to catch the error in detection and try the next line.

Migrated from matomo-org/matomo#4230

Log Analytics: add support for Netscaler w3c logs

It was requested that we add support to parse Netscaler w3c logs.

Some Load Balancers (Citrix Netscaler) use this w3c format.

Here is a sample log format:

#Version: 1.0
#Software: Netscaler Web Logging(NSWL)
#Date: 2014-02-18 11:55:13
#Fields: date time c-ip cs-username sc-servicename s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status cs-bytes sc-bytes
 time-taken cs-version cs(User-Agent) cs(Cookie) cs(Referer)
2014-02-18 11:55:13 172.20.1.21 - HTTP 192.168.6.254 8080 GET /Citrix/XenApp/Wan/auth/login.jsp - 302 247 355 0 HTTP/1.1 Mozi
lla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1;+Trident/4.0;+.NET+CLR+1.1.4322;+.NET+CLR+2.0.50727;+.NET+CLR+3.0.04506.648;+.N
ET+CLR+3.5.21022) - -
2014-02-18 11:55:13 172.20.1.21 - HTTP 192.168.6.254 8080 GET /Citrix/XenApp/Wan/auth/silentDetection.jsp - 200 310 5609 0 HT
TP/1.1 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+5.1;+Trident/4.0;+.NET+CLR+1.1.4322;+.NET+CLR+2.0.50727;+.NET+CLR+3.0.04
506.648;+.NET+CLR+3.5.21022) JSESSIONID=7BBF2F11B80261B27D23010421412323 -

Migrated from matomo-org/matomo#4707

Log Analytics: let user import error logs

The goal of this issue is to improve Log Analytics to let the tool import error logs in Piwik.

Currently, Log Analytics imports access logs from many different log formats, but it won't import error logs. The value is very high in importing error logs for example visualising errors, counting errors over time, monitoring error spikes with Alerts plugin, and more.

Migrated from matomo-org/matomo#6241

Make import_logs.py IPv6 compatible for W3c extended / IIS log formats

Hi,

we are currently investigating if piwik logfile analytics can replace awstats log file analytics.
Doing so we came across the problem that piwik seems not to be able to parse log files with IPv6 addresses. Removing every IPv6 log line from file and the import works fine.

The following import command is used

python /var/www/piwik.ibumedia.de/misc/log-analytics/import_logs.py --url=http://piwik.ibumedia.de --idsite=77 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots --recorder-max-payload-size=400 --dry-run 20130728.log

The log format is an iis 7 w3c format with all fields selected.
Lines like this will fail on import

2013-07-28 02:34:09 W3SVC01 HOST123 1234:1234:12:123::1 GET / - 80 - 2a01:4f8:0:a101::6:1 HTTP/1.1 Hetzner+System+Monitoring - - www.example.com 200 0 0 3446 96 0

best regards
mr.moe
Keywords: logfile ipv6 import

Migrated from matomo-org/matomo#4062

Log Analytics: Monitor Bandwidth for each page, download, and measure overall traffic in bytes

As a user, when importing my server logs in analytics, I want to measure the Bandwidth that was used by each page view.

How would it work?

  • The bandwidth information is commonly available in the server access log files. It is measured in Bytes.
  • Log analytics script would detect the numeric bandwidth value
  • Log analytics forwards this value to the Tracking API and this value will be stored in the action.
  • This value can be attached to downloads and pageviews
  • The Reporting UI:
    • for each page: display the total bandwidth for each page view
    • for each directory (groups of pages): display the aggregated bandwidth value (the sum) of all pages within the directory

Proposed implementation:

  • New column in log_link_visit_action: bandwidth
  • Tracking API add new parameter: file size in bytesbw_bytes
    • Log Analytics parses filesize, and sets &bw_bytes to tracking api requests
    • first log format we need to support is Apache common log format.
    • Tracker stores bandwidth in the new column log_link_visit_action.bandwidth
    • Actions/Archiver will aggregate the filesize in the Action report blobs.
    • This is similar to Average generation time.
    • The metric is processed for Pages, Page titles, and Download files.
    • User interface: the new metric "Bytes" will be displayed in the Actions tables
    • It is displayed in the existing Actions reports (Pages, Page titles, Downloads): this is not a new report.
    • If a user is not using Log Analytics, or if he is using Log Analytics but the logs don't have the file size bandwidth, then Actions reports will not have the "Bytes" column.

Other steps:

  • Add new overall metric (and sparkline) in Visitors > Overview report: Total Bandwidth
  • Add new FAQ How do I measure traffic bandwidth used by a page, and/or overall bandwidth?

To be confirmed / optional:

  • Besides Apache common log format, maybe other log formats contain the bandwidth information
  • New custom segment "Bandwidth" to let users segment traffic based on the file size in bytes
    • for example "Show me reports only for file requests of files over 500,000"
  • We could even measure in Javascript the current page size, which would be useful, but unfortunately it can only be done with approximate value. This counts the number of bytes in DOM tree: document.documentElement.innerHTML.length.
    • This is an approximation only. if there's lots of page content created on the fly from script, that content will count in innerHTML despite not being present in the original source, which could throw your calculation out source

Migrated from matomo-org/matomo#5248

On same Webpage, Custom Variables gets appended always to the previous ones.

Q. How can we remove index information of previous tracking attempt's indexed information, from cvar,scope='page' variable ?

Scenario :

I am working currently on project that includes Liferay Portal integrated with Springs and hibernate.

There are two sections on single page

  1. Browse section
  2. Search section

Search section has one integrated button of Download. In download section, we have applied tracking for few things by using 'setCustomVariable' of piwik.js, and output of call to piwik.php comes like

cvar output(through HTTPFox) :

{"1":Download","155231","3":["source","portal"]}


Now when we apply tracking through 'setCustomVariable' in Browse section(on same page), we get

"1":Download","155231"...

appended to our new piwik.php call. Comes like :

{"1":Download","155231","3":["source","portal"],"9":["ASSET","37"]}


Expected Output :

{"9":["ASSET","37"]}

In next tracking attempt(Browse on same page where search has been done).

Keywords: Custom Variable, cvar, tracking, setCustomVariable

Migrated from matomo-org/matomo#5302

import_logs.py doesn't support Unicode characters with json format

I'm working with iOS apps and sometimes their names have special characters and when this apps make requests to my server, the import_log.py script fails to parse this lines. This only happens when using nginx_json format.

Here's an example:

{"ip": "1.1.1.1","host": "blabla.com","path": "/api/test.xml","status": "200","referrer": "-","user_agent": "F\xFAtbol/1.0 (iPhone; iOS 7.1; Scale/2.00)","length": 267,"generation_time_milli": 0.009,"date": "2014-04-16T14:56:24-03:00"}

Check that user_agent has a special character and that the script fails when parsing this lines.

Thanks

Migrated from matomo-org/matomo#5013

import_logs.py needs upgrade to Python 3 or updated shebang line

Some distributions like Arch have /usr/bin/python point to python3, not Python 2.x. Even more popular distributions are transitioning to do the same thing. Currently, the import-logs.py script does not work under Python 3. As such, either it should be updated to support Python 3, either its shebang line should be changed to #!/usr/bin/python2 instead of assuming that /usr/bin/python is Python 2.x.
Keywords: log-analytics, python

Migrated from matomo-org/matomo#3759

Pages missing when importing IIS 8.5 logs

Importing IIS 8.5 logs (from Windows Server 2012), using import_logs.py, does not populate anything under Action > Pages. Adding the 'action_name' argument enables the pages to be tracked.

Around line 1243:
----------------
if config.options.replay_tracking:
    # prevent request to be force recorded when option replay-tracking
    args['rec'] = '0'
args.update(hit.args)


Changed to include action_name so Pages and Page Titles are populated.
================
args['action_name'] = url.encode('utf8')

if config.options.replay_tracking:
    # prevent request to be force recorded when option replay-tracking
    args['rec'] = '0'
args.update(hit.args)

Migrated from matomo-org/matomo#4937

Specify site ID or domain in "Logs Import summary"

When I look at my piwik logs it's difficult to understand what website I'm viewing the logs of.. as I have a list of import summaries without them specifying their site ID (or, better, the domain name):

Purging Piwik archives for dates: 2014-10-06

To re-process these reports with your newly imported data, execute the following command: 
$ /path/to/piwik/console core:archive --url=http://example/piwik/

Reference: http://piwik.org/docs/setup-auto-archiving/ 

Logs import summary
-------------------

    154 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 HTTP errors
        0 HTTP redirects
        0 invalid log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        0 requests done by bots, search engines...
        0 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

Website import summary
----------------------

    154 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 1 seconds
    Requests imported per second: 86.7 requests per second

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...

Logs import summary
-------------------
    0 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 HTTP errors
        0 HTTP redirects
        0 invalid log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        0 requests done by bots, search engines...
        0 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

Website import summary
----------------------

    0 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 0.0 requests per second

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: cannot automatically determine the log format using the first 100000 lines of the log file. 
Maybe try specifying the format with the --log-format-name command line argument.
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
913 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
913 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
913 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
913 lines parsed, 300 lines recorded, 74 records/sec (avg), 300 records/sec (current)
913 lines parsed, 300 lines recorded, 59 records/sec (avg), 0 records/sec (current)
913 lines parsed, 600 lines recorded, 99 records/sec (avg), 300 records/sec (current)
913 lines parsed, 600 lines recorded, 85 records/sec (avg), 0 records/sec (current)
913 lines parsed, 913 lines recorded, 113 records/sec (avg), 313 records/sec (current)

Purging Piwik archives for dates: 2014-10-06

To re-process these reports with your newly imported data, execute the following command: 
$ /path/to/piwik/console core:archive --url=http://example/piwik/

Reference: http://piwik.org/docs/setup-auto-archiving/ 

Logs import summary
-------------------

    913 requests imported successfully
    2 requests were downloads
    0 requests ignored:
        0 HTTP errors
        0 HTTP redirects
        0 invalid log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        0 requests done by bots, search engines...
        0 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

Website import summary
----------------------

    913 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 8 seconds
    Requests imported per second: 108.99 requests per second

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: cannot automatically determine the log format using the first 100000 lines of the log file.
Maybe try specifying the format with the --log-format-name command line argument.
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: cannot automatically determine the log format using the first 100000 lines of the log file. 
Maybe try specifying the format with the --log-format-name command line argument.
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: cannot automatically determine the log format using the first 100000 lines of the log file. 
Maybe try specifying the format with the --log-format-name command line argument.
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...

Purging Piwik archives for dates: 2014-10-06

To re-process these reports with your newly imported data, execute the following command: 
$ /path/to/piwik/console core:archive --url=http://example/piwik/

Reference: http://piwik.org/docs/setup-auto-archiving/ 

Logs import summary
-------------------

    62 requests imported successfully
    5 requests were downloads
    0 requests ignored:
        0 HTTP errors
        0 HTTP redirects
        0 invalid log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        0 requests done by bots, search engines...
        0 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

Website import summary
----------------------

    62 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:

Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 99.19 requests per second

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...

Purging Piwik archives for dates: 2014-10-06

To re-process these reports with your newly imported data, execute the following command: 
$ /path/to/piwik/console core:archive --url=http://example/piwik/

Reference: http://piwik.org/docs/setup-auto-archiving/ 

Logs import summary
-------------------

    15 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 HTTP errors
        0 HTTP redirects
        0 invalid log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        0 requests done by bots, search engines...
        0 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

Website import summary
----------------------

    15 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 69.03 requests per second

Migrated from matomo-org/matomo#6393

Fatal Error: '' from import_logs.py

This started about a week ago, I don't think anything changed here. Every night we run,

import_logs.py --enable-http-errors --enable-http-redirects --enable-bots --enable-static --recorders=6 --url=https://analytics.example.com/ <all logs from yesterday>

Now it regularly gets stuck:

Parsing log /var/log/apache2/example.com/www/access/access-2014-10-14.log...
6886 lines parsed, 6284 lines recorded, 79 records/sec (avg), 148 records/sec (current)
6886 lines parsed, 6461 lines recorded, 80 records/sec (avg), 177 records/sec (current)
6886 lines parsed, 6676 lines recorded, 82 records/sec (avg), 215 records/sec (current)
...
6886 lines parsed, 6676 lines recorded, 25 records/sec (avg), 0 records/sec (current)

and eventually fails with the following:

6886 lines parsed, 6676 lines recorded, 25 records/sec (avg), 0 records/sec (current)
Fatal error: ''
You can restart the import of "/var/log/apache2/example.com/www/access/access-2014-10-14.log" from the point it failed by specifying --skip=5 on the command line.

Is that the sixth line of the log file? If so, there's nothing weird in it.

The problem began on 2.7.0, but persists after an upgrade to 2.8.0.

Migrated from matomo-org/matomo#6451

log-analytics & import_logs.py, hosts parameter failing

Hi,

Like I sad on the forum: http://forum.piwik.org/read.php?2,123133,123166#msg-123166.

I try to import some logs from the nginx webserver into piwik 2.9.1.

I have followed the howto from https://github.com/piwik/piwik/tree/master/misc/log-analytics#setup-nginx-logs.

This setup generates a nice access log file that looks like:

{"ip": "192.168.1.10","host": "www.test.nl","path": "/","status": "200","referrer": "https://www.test.nl/","user_agent": "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36","length": 915,"generation_time_milli": 0.001,"date": "2015-01-02T13:20:31+01:00"}

When I try to import this access log with:
$ python /var/www/misc/log-analytics/import_logs.py --url=http://webanalyse.domain.local/ --add-sites-new-hosts --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots --log-format-name=nginx_json --config /var/www/config/config.ini.php access.log

I get the flowing error:

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log access.log...
Traceback (most recent call last):
File "/var/www/misc/log-analytics/import_logs.py", line 1750, in
main()
File "/var/www/misc/log-analytics/import_logs.py", line 1717, in main
parser.parse(filename)
File "/var/www/misc/log-analytics/import_logs.py", line 1576, in parse
resolver.check_format(format)
File "/var/www/misc/log-analytics/import_logs.py", line 1119, in check_format
elif 'host' not in format.regex.groupindex and not config.options.log_hostname:
AttributeError: 'NoneType' object has no attribute 'groupindex'

As I understands is correctly is it missing the host parameter in the access log but as you can see it is there. What goes wrong?

Kinds regards,

Michiel Piscaer

Migrated from matomo-org/matomo#6919

Log Analytics: new parameter --download-extensions to override list of files tracked as downloads

By default log analytics is importing logs and when detecting a known file downloads it will record the action as a Download, accessible in Actions > Download.

The goal of this issue is to let a Piwik user specify which file extensions should be tracked as downloads. Other files will be discarded.

The current list of extensions tracked as download is:

    '7z aac arc arj asf asx avi bin csv deb dmg doc exe flv gz gzip hqx '
    'jar mpg mp2 mp3 mp4 mpeg mov movie msi msp odb odf odg odp '
    'ods odt ogg ogv pdf phps ppt qt qtm ra ram rar rpm sea sit tar tbz '
    'bz2 tbz tgz torrent txt wav wma wmv wpd xls xml xsd z zip '
    'azw3 epub mobi'

One solid use case is "Track only PDF and doc files and ignore all the rest", which this new parameter will provide via --download-extensions=pdf,doc

Migrated from matomo-org/matomo#6214

Support Page Speed tracking in IIS 8 log files (generation time)

Hi There,

I have imported IIS 8 log ( including time-taken field ) into Piwik 2.7.0 successfully.

However the time-taken information is not working as " average generation time " are all 0s in the report.

Not sure it is a incorrect regex or the "generation_time_milli" parameter is not supporting IIS log yet?

( I've tested apache CLF, average generation time is working fine in the report )

log.jpg - the log file content for import. 
#Fields: date time cs-uri-stem cs-uri-query c-ip cs(User-Agent) cs(Referer) sc-status time-taken 
2014-09-27 10:00:00 /api/AutoCompleteService/apiEstateAutoComplete/ keyword=%E6%96%B0%E5%85%83&callback=_jqjsp&_1411811867650= 119.247.34.224 Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/37.0.2062.124+Safari/537.36 http://hk.cet.com/home/Index.aspx 200 436000 
![log](https://cloud.githubusercontent.com/assets/9033718/4523559/01cbd500-4d39-11e4-8d20-0e5da9caae8b.jpg)

result.jpg - the import error is " invalid line detected ( line did not match ) "

./import_logs.py --url=http://localhost/piwik/ /var/www/html/piwik/misc/log-analytics/test2.log --idsite=1 --dry-run --show-progress --debug --debug --log-format-regex="(?P<date>^\d+[-\d+]+[\d+:]+) (?P<path>/\S*) (?P<query_string>\S*) (?P<ip>[\d*.]*) (?P<user_agent>\S+) (?P<referrer>\S+) (?P<status>\d+) (?P<generation_time_milli>\S+)" 

result

Many thanks

Migrated from matomo-org/matomo#6388

Custom log format for log analytics

Hi,

How can i create custom log format in piwik. Actually i am unable to load log file through log_analytics.py script. I think i have to create new log format.

Sample log line:
xxx.xxx.xxx.x - - [21/Dec/2013:04:11:59 +0000] "GET /cds/phQoANRmI3SvEcHzQurM3vMGqI5u6iAYcEGjbwLrPZOLVbcOeys-qSavG9Fz03sZ1PAZkys48rhfHPrG4qhwZGB31Q5b4IdAgDaMYJO8inpJntM.?id=b2YxxoncjD7B-jzZLa75&expirationTime=1387858318631 HTTP/1.1" 200 6218 "-" "-" 15

Error Message:
Fatal error: cannot automatically determine the log format using the first 100000 lines of the log file.
Maybe try specifying the format with the --log-format-name command line argument.

Could you please send me documentation for creating custom log formats?

thanks,
Sudesh

Migrated from matomo-org/matomo#5947

import_logs.py ignores lines after a line with http 200 status is processed

Hello to all!

I am using piwik for a customer and just found out the following very serious issue.

I am using the latest piwik (2.2.2), php 5.4.26 and Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) v.1500 32 bit (Intel) on win32.

PROBLEM:

All lines (in the web log) after a line with HTTP status 200 are ignored!! i.e. in the following example only the first entry is included both to the Visits and to the Actions. This applies before or after I do the achieving. So archiving is irrelevant.

I just import (access.log : file with just 2 lines):

66.249.76.11 - - +0100 "GET /id/resource/013541589 HTTP/1.1" 303 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.76.11 - - +0100 "GET /doc/resource/007667232 HTTP/1.1" 200 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

via command:

python import_logs.py --url=http://localhost:83/

analytics/ access.log --idsite=1 --recorders=2 --enable-http-errors --enable-http-redirects --enable-static --ena
ble-bots

Result:

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log access_006_bl.services.tso.co.uk.2014.05.12.log...
Purging Piwik archives for dates: 2014-05-11
To re-process these reports with your new update data, execute the following command:

piwik/console core:archive --url=http://example/piwik/

Reference: http://piwik.org/docs/setup-auto-archiving/

Logs import summary

2 requests imported successfully
0 requests were downloads
0 requests ignored:

    0 invalid log lines
    0 requests done by bots, search engines, ...
    0 HTTP errors
    0 HTTP redirects
    0 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

Website import summary

2 requests imported to 1 sites

    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 0 seconds
Requests imported per second: 3.29 requests per second

Kind Regards,
Vassilis

Migrated from matomo-org/matomo#5161

IIS Advanced Logging Module log files support for Log Analytics

The goal of this issue is to add support in the Log Analytics tool for automatically importing the access logs generated by IIS web server called IIS Advanced Logging Module.

This is a common log format for users of IIS server. Here is an example:

#Software: IIS Advanced Logging Module
#Version: 1.0
#Start-Date: 2014-11-18 00:00:00.128
#Fields:  date-local time-local s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) cs(Referer) cs(Host) sc-status sc-substatus sc-win32-status TimeTakenMS
2014-11-17 17:00:00.363 10.10.28.140 GET /Products/X/_Images/ico_print.gif - 80 - "70.95.93.8" "Mozilla/5.0 (Linux; Android 4.4.4; SM-G900V Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.59 Mobile Safari/537.36" "http://example.com/Search/SearchResults.pg?informationRecipient.languageCode.c=en" "xzy.example.com" 200 0 0 109
2014-11-17 17:00:00.660 10.10.28.140 GET /Topic/hw43061 - 80 - "157.55.39.72" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" - "example.hello.com" 302 0 0 0
2014-11-17 17:00:00.675 10.10.28.140 GET /hello/world/6,681965 - 80 - "173.5.186.174" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.0) LinkCheck by Siteimprove.com" - "hello.example.com" 404 0 0 359

This was also requested in #261 with a similar yet different format:

#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2014-04-10 12:40:24.190 - GET /Common/shop/shoppingcartAJAX.asp - 80 -
192.168.1.1 "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/34.0.1847.116 Safari/537.36" 200 0 0 2036

Notes:

  • User should be able to import such logs without having to use the complicated "Custom regular expression feature" (example).
  • Maybe all IIS advanced log files contain the schema at the start of the file eg. #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken which maybe we could use to parse the file?
  • Add automated test in eg. ImportLogsTest or so

Refs #4707

Migrated from matomo-org/matomo#6795

Log analytics - UTF-8 link from search bug

I use piwik/misc/log-analytics/import_logs.py for my log analytics.
A small part of queries is defined as ????. However, if you look at the source HTML, then everything is OK.
In the picture on the link address http://yandex.ru/yandsearch?text=%EA%F3%EF%E8%F2%FC+%E2%FB%EF%F3%F1%EA%ED%EE%E5+%EF%EB%E0%F2%FC%E5&lr=213
which was well converted into Russian.
It is observed on 20% of all requests from Russian yandex and google.
piwik log analytics

Migrated from matomo-org/matomo#5885

Log analytics : import_logs.py doesn't work any more

Hello,

I use import_logs.py and php5 console core:archive to update the visits of my sites. But it doesn't work since I'm in 2.4 and now 2.5 version.

Apache Log format

LogFormat "%{Host}i %h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"" vhosts

Data in MySQL

Data in the DB are not correct :

7470 done 128 2014-08-01 2014-08-01 1 2014-08-01 22:09:14 1
7472 donefea44bece172bc9696ae57c26888bf8a.VisitsSummary 128 2014-08-01 2014-08-01 1 2014-08-01 22:09:14 1
126922 done 128 2014-08-01 2014-08-31 3 2014-09-01 11:10:17 1
126923 donefea44bece172bc9696ae57c26888bf8a.VisitsSummary 128 2014-08-01 2014-08-31 3 2014-09-01 11:10:19 1
13568 done 128 2014-08-02 2014-08-02 1 2014-08-02 22:01:22 1

Import summary

The import summary seems to be correct :

Purging Piwik archives for dates: 2014-07-22 2014-06-19 2014-06-25 2014-07-21 2014-06-20 2014-06-26 2014-07-07 2014-06-05

To re-process these reports with your newly imported data, execute the following command:
$ /path/to/piwik/console core:archive --url=http://example/piwik/

Reference: [piwik.org]

Logs import summary

2138 requests imported successfully
38 requests were downloads
7713 requests ignored:
2119 invalid log lines
215 requests done by bots, search engines, ...
241 HTTP errors
134 HTTP redirects
5004 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname

Website import summary

2138 requests imported to 1 sites
1 sites already existed
0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 53 seconds
Requests imported per second: 39.82 requests per second

Migrated from matomo-org/matomo#6120

When the BulkTracking plugin is disabled, bulk imports succeed, but no data is imported

When the BulkTracking plugin is disabled, the import_logs.py reports success when importing logfiles, but no data is actually imported. This may lead to unnoticed data loss. It would be better if the API reported an error when this plugin is disabled and a request for bulk importing data is done. I've reproduced this with version 2.10.0. Steps:

  1. Disable the BulkTracking plugin and restart the webservice/PHP-FPM
  2. Perform log imports using import_logs.py, the script will report 'OK'
  3. Perform an archive run, no visits will be imported

Migrated from matomo-org/matomo#6982

import_logs.py occasionally reports .mobi domain visit as .mobi file download

import_logs.py will sometimes report a visit to a .mobi domain home page as a download of a .mobi file. Here is an example of a visit reported as such:

xxxx:xxx:x::xxx - - [09/Apr/2014:07:43:55 +0000] "GET http://example.mobi HTTP/1.0" 200 581 "-" "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0\x5Cr\x5Cn"
xxxx:xxx:x::xxx - - [09/Apr/2014:07:43:55 +0000] "GET /apple-touch-icon-precomposed.png HTTP/1.0" 404 162 "-" "-"

Of the above two clicks, only one click was included in the report.

Migrated from matomo-org/matomo#4980

Log analytics: import_logs.py returns Error: Not Found

When I run the script to ingest log files it fails with the following errors:

root@demo:/var/www/piwik/logs# python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://129.123.194.52/piwik --idsite 2 --enable-static --enable-bots ./intra.log
Traceback (most recent call last):
File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1760, in
resolver = config.get_resolver()
File "/var/www/piwik/misc/log-analytics/import_logs.py", line 660, in get_resolver
return StaticResolver(self.options.site_id)
File "/var/www/piwik/misc/log-analytics/import_logs.py", line 1005, in init
'SitesManager.getSiteFromId', idSite=self.site_id
File "/var/www/piwik/misc/log-analytics/import_logs.py", line 987, in call_api
return cls._call_wrapper(cls._call_api, None, None, method, **kwargs)
File "/var/www/piwik/misc/log-analytics/import_logs.py", line 976, in _call_wrapper
raise Piwik.Error(message)
main.Error: Not Found

What do I do to fix this?

Migrated from matomo-org/matomo#5059

import_logs.py fail to populate actions/page tables

In piwik 2.1 with NCSA extended logs (apache) I'm not able to view page actions but errors, statics and redirections.

Version: 2.1.1b10

python /var/www/html/logimport/misc/log-analytics/import_logs.py --recorders=8 --url=http://localhost/logimport/ /tmp/test30032014.log --login=admin --password=pass --token-auth=xxxxxxx --idsite=1

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log /tmp/test30032014.log...
Purging Piwik archives for dates: 2014-03-30
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: [piwik.org] for more info.

Logs import summary

102 requests imported successfully
107 requests were downloads
398 requests ignored:
0 invalid log lines
11 requests done by bots, search engines, ...
19 HTTP errors
76 HTTP redirects
292 requests to static resources (css, js, ...)
0 requests did not match any known site
0 requests did not match any requested hostname

Website import summary

102 requests imported to 1 sites
1 sites already existed
0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 0 seconds
Requests imported per second: 148.91 requests per second

/usr/bin/php /var/www/html/logimport/console core:archive --url=http://localhost/logimport/ -v

INFO CoreConsole07:32:39 ---------------------------
INFO CoreConsole07:32:39 INIT
INFO CoreConsole07:32:39 Piwik is installed at: http://localhost/logimport/index.php
INFO CoreConsole07:32:39 Running Piwik 2.1.1-b10 as Super User: piwikadmin
INFO CoreConsole07:32:40 ---------------------------
INFO CoreConsole07:32:40 NOTES
INFO CoreConsole07:32:40 - If you execute this script at least once per hour (or more often) in a crontab, you may disable 'Browser trigger archiving' in Piwik UI > Settings > General Settings.
INFO CoreConsole07:32:40 See the doc at: [piwik.org]
INFO CoreConsole07:32:40 - Reports for today will be processed at most every 3600 seconds. You can change this value in Piwik UI > Settings > General Settings.
INFO CoreConsole07:32:40 - Reports for the current week/month/year will be refreshed at most every 3600 seconds.
INFO CoreConsole07:32:40 - Archiving was last executed without error 5 min 1s ago
INFO CoreConsole07:32:40 - Will process 0 websites with new visits since 5 min 0s
INFO CoreConsole07:32:40 - Will process 1 other websites because some old data reports have been invalidated (eg. using the Log Import script) , IDs: 1
INFO CoreConsole06:32:40 ---------------------------
INFO CoreConsole06:32:40 START
INFO CoreConsole06:32:40 Starting Piwik reports archiving...
INFO CoreConsole06:32:41 Archived website id = 1, period = day, Time elapsed: 1.113s
INFO CoreConsole06:32:42 Archived website id = 1, period = week, 29 visits, Time elapsed: 0.943s
INFO CoreConsole06:32:53 Archived website id = 1, period = month, 0 visits, Time elapsed: 10.732s
INFO CoreConsole06:33:00 Archived website id = 1, period = year, 47440 visits, Time elapsed: 7.205s
INFO CoreConsole06:33:00 Archived website id = 1, today = 0 visits, 4 API requests, Time elapsed: 20.004s done
INFO CoreConsole06:33:00 Done archiving!
INFO CoreConsole06:33:00 ---------------------------
INFO CoreConsole06:33:00 SUMMARY
INFO CoreConsole06:33:00 Total daily visits archived: 0
INFO CoreConsole06:33:00 Archived today's reports for 1 websites
INFO CoreConsole06:33:00 Archived week/month/year for 1 websites
INFO CoreConsole06:33:00 Skipped 0 websites: no new visit since the last script execution
INFO CoreConsole06:33:00 Skipped 0 websites day archiving: existing daily reports are less than 3600 seconds old
INFO CoreConsole06:33:00 Skipped 0 websites week/month/year archiving: existing periods reports are less than 3600 seconds old
INFO CoreConsole06:33:00 Total API requests: 4
INFO CoreConsole06:33:00 done: 1/1 100%, 0 v, 1 wtoday, 1 wperiods, 4 req, 20073 ms, no error
INFO CoreConsole06:33:00 Time elapsed: 20.074s
INFO CoreConsole06:33:00 ---------------------------
INFO CoreConsole06:33:00 SCHEDULED TASKS
INFO CoreConsole06:33:00 Starting Scheduled tasks...
INFO CoreConsole06:33:00 No task to run
INFO CoreConsole06:33:00 done
INFO CoreConsole06:33:00 ---------------------------
Keywords: import_logs.py

Migrated from matomo-org/matomo#4946

Query parameters imported incorrectly from IIS 8.5 logs

IIS 8.5 (Windows Server 2012) logs an empty query string as a dash '-' which the Piwik importer picks up and appends '?-' to all URLs. This change to import_logs.py checks for '-' as a query parameter and excludes it.

Around line 1575:
----------------
try:
    hit.query_string = format.get('query_string')
    hit.path = hit.full_path
except BaseFormatException:
    hit.path, _, hit.query_string = hit.full_path.partition(config.options.query_string_delimiter)


Changed to exclude '-' because IIS defaults to '-' if there is no query string.
----------------
try:
    hit.query_string = format.get('query_string')
    if hit.query_string == '-':
        hit.query_string = ''
    hit.path = hit.full_path
except BaseFormatException:
    hit.path, _, hit.query_string = hit.full_path.partition(config.options.query_string_delimiter)

Migrated from matomo-org/matomo#4936

Log Analytics: Track the HTTP request method used for each request (GET or POST)

As a user, when importing server logs in my analytics platform, I want to be able to see:

  • How many GET and POST requests there were
  • Segment traffic to view only GET requests
  • Segment traffic to view only POST requests
    • For example, show me all pages that were POST requests (eg. web forms)

To achieve this, we need to:

  • Log Analytics: parse the HTTP request method
    • Send the HTTP request method as a custom variable: ["GET"]("HTTP-method":)

Once this is done, as a Piwik user I can:

  • Create a segment to see only GET or POST requests (using segment on this Custom Variable)
  • View the Custom Variables report and see how many GET and POST there were, and for each: number of visits, number of pageviews.

Migrated from matomo-org/matomo#5359

Log analytics list of improvements

In Piwik 1.8 we released the great new feature to import access logs and generate statistics.

The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!

New features

  • Track non-bot activity only. When --enable-bots is not specified, it would be a nice improvement if we:

    • exclude visits with more than 150 actions per visitorID to block crawlers (detected at the python level by counting requests for that IP in the queue)
    • exclude visits that do not have User Agent or beyond the very basic ones used by all bots
    • exclude all requests when one of the first ones is for /robots.txt -- if we see a robots.txt in the middle we could stop tracking subsequent requests
    • check that /index.php?minimize_js=file.js is counted as a static file since it ends in .js

    After that bots & crawlers detection would be much better.

  • Support Accept-Language header and forward to piwik via the &lang= parameter. That might also be useful to some users who need to use this data in a custom plugin.

  • we could make it easy to delete logs for one day so to reimport one log file

  • This would be a new option to the python script. It would reuse the code from the Log Delete feature, but would only delete one day. The python script would call the CoreAdmin API for example, deleting this single day for a given website. This would allow to easily re-import data that didn't work the first time or was bogus.

  • Detect when log-lines are re-imported and only import them once.

    • Implementation: add new table piwik_log_lines (hash_tracking_request, day ))
    • In Piwik Tracker, before looping on the bulk requests, SELECT all the log lines that have already been processed on this day (WHERE hash_tracking_request IN (a,b,c,d) AND day=?) & Skip these requests from import
    • After bulk requests are processed in piwik.php process, INSERT in bulk (hash, day)
  • By default this feature would be enabled only for "Log import" script,

    • via a parameter that we know is the log import (&li=1 /import_logs=1)
    • but may be later useful to all users of Tracking API for general deduping service.

PERFORMANCE'

How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.

Other tickets

  • #3867 cannot resume with line number reported by skip for ncsa_extended log format
  • #4045 autodetection hangs on a weird formatted line

Migrated from matomo-org/matomo#3163

Error importing log

111.111.111.111 - - [28/Aug/2014:10:08:33 +0000] "GET /example?action_name=Scroll%2FSolutions&idsite=1&rec=1&r=237194&h=15&m=38&s=32&url=http%3A%2F%2Fwww.exampleanalytics.com%2F&urlref=http%3A%2F%2Fwww.google.com%2Furl%3Fsa%3Dt%26rct%3Dj%26q%3D%26esrc%3Ds%26source%3Dweb%26cd%3D2%26ved%3D0CCYQFjAB%26url%3Dhttp%253A%252F%252Fwww.exampleanalytics.com%252F%26ei%3DMv7-U5z6IsXHuAS2mYHoAQ%26usg%3DAFQjCNFPdvVMNR_LOUYCjRLFfuxQo22Ziw%26sig2%3DoTlVrzpboH139_vEqyqhIQ%26bvm%3Dbv.74035653%2Cd.c2E&_id=66e27316414a3f98&_idts=1409220163&_idvc=1&_idn=0&_refts=1409220163&_viewts=1409220163&_ref=http%3A%2F%2Fwww.google.com%2Furl%3Fsa%3Dt%26rct%3Dj%26q%3D%26esrc%3Ds%26source%3Dweb%26cd%3D2%26ved%3D0CCYQFjAB%26url%3Dhttp%253A%252F%252Fwww.exampleanalytics.com%252F%26ei%3DMv7-U5z6IsXHuAS2mYHoAQ%26usg%3DAFQjCNFPdvVMNR_LOUYCjRLFfuxQo22Ziw%26sig2%3DoTlVrzpboH139_vEqyqhIQ%26bvm%3Dbv.74035653%2Cd.c2E&pdf=1&qt=0&realp=0&wma=0&dir=0&fla=1&java=0&gears=0&ag=0&cookie=1&res=1366x768&gt_ms=601 HTTP/1.1" 204 169 "http://www.exampleanalytics.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0"

This line has a corresponding entry in piwik_log_link_visit_action for custom_var_k1, custom_var_v1 as follows:
HTTP-code, 204

Infact most of the entries are being treated like this. The debug mode says that the entry is correctly recognised as ncsa_extended.

Migrated from matomo-org/matomo#6113

Support Import cloudfront logs

Log format documentation:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html

Here's an example of it:

#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken
2014-07-21      01:58:45        DFW3    563     180.76.5.149    GET     d116n0k3gjrs63.cloudfront.net   /robots.txt     301     -       Mozilla/5.0%2520(Windows%2520NT%25205.1;%2520rv:6.0.2)%2520Gecko/20100101%2520Firefox/6.0.2     -       -       Redirect        NLmAqKRfyqQreOK6jMmjVhh8vaUV-CbEM7m_Kta_eoZIxl0VWTEmcQ==        www.moneypot.com        http    188     0.000
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken
2014-07-21      01:44:46        AMS50   3920    5.134.58.69     GET     d116n0k3gjrs63.cloudfront.net   /img/icons/chart.png    200     https://www.moneypot.com/       Mozilla/5.0%2520(Windows%2520NT%25205.1;%2520rv:30.0)%2520Gecko/20100101%2520Firefox/30.0       -       -       Hit     j2Rusy95IefqGwjXxxfdEp1r53CLEXf7KlcHBHqyHKHN6GnQSrN8-A==        www.moneypot.com        https   329     0.002
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken
2014-07-21      01:59:06        FRA50   1096    41.227.234.26   GET     d116n0k3gjrs63.cloudfront.net   /css/header.css 200     https://www.moneypot.com/       Mozilla/5.0%2520(Windows%2520NT%25206.2;%2520WOW64;%2520rv:30.0)%2520Gecko/20100101%2520Firefox/30.0    -       -       Miss    0_y6dyIn9nsl1lI1kewczEf8BCktKcQS1hOiuAjiGunQpHnOzpgBfQ==        www.moneypot.com        https   316     0.417
2014-07-21      01:59:06        FRA50   7129    41.227.234.26   GET     d116n0k3gjrs63.cloudfront.net   /img/icons/bitcoin-ic.png       200     https://www.moneypot.com/       Mozilla/5.0%2520(Windows%2520NT%25206.2;%2520WOW64;%2520rv:30.0)%2520Gecko/20100101%2520Firefox/30.0    -       -       Miss    YSRgbarSQU48CjLSnxey1dGb7El85f1z_Ez1MM8fcdEzzBaX_GZS9Q==        www.moneypot.com        https   341     0.415
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken
2014-07-21      01:44:46        AMS50   2878    5.134.58.69     GET     d116n0k3gjrs63.cloudfront.net   /img/icons/secure.png   200     https://www.moneypot.com/       Mozilla/5.0%2520(Windows%2520NT%25205.1;%2520rv:30.0)%2520Gecko/20100101%2520Firefox/30.0       -       -       Hit     kDqn-NGq9YfCnUhK_6tYPVyMScysHrnFfySMMkWvkV43PHpZMX2Xaw==        www.moneypot.com        https   330     0.002

Since it uses w3c extended format, it is related to #5418

Cloudfront stores the logs in a directory, full of gzipped log files, e.g.

E2S2NV7MT2UOQA.2014-07-29-14.yEaJrFBy.gz
E2S2NV7MT2UOQA.2014-07-29-14.ZkCwYK8H.gz
E2S2NV7MT2UOQA.2014-07-29-14.ZpGqzm7o.gz

So it would be extra nice if one could just specify the containing directory.

Migrated from matomo-org/matomo#5894

import_logs.py fails when "php" is not available (OSError: [Errno 2] No such file or directory)

Since 5 days, my piwik import_logs.py do nothing.
See the following error.

root@s15879177:~# python /var/www/tools/piwik/misc/log-analytics/import_logs.py --url=https://mywebsite/piwik/ --idsite=1 --show-progress --recorders=4 --enable-reverse-dns --useragent-exclude=oui-moi-cmoi --exclude-path=/robots.txt --exclude-path=/ubuntu/* --exclude-path=/public/ubuntu/* --exclude-path=*.svg.php /var/log/lighttpd/access.log.6.gz
Traceback (most recent call last):
  File "/var/www/tools/piwik/misc/log-analytics/import_logs.py", line 1722, in <module>
    config = Configuration()
  File "/var/www/tools/piwik/misc/log-analytics/import_logs.py", line 570, in __init__
    self._parse_args(self._create_parser())
  File "/var/www/tools/piwik/misc/log-analytics/import_logs.py", line 560, in _parse_args
    self.options.piwik_token_auth = self._get_token_auth()
  File "/var/www/tools/piwik/misc/log-analytics/import_logs.py", line 607, in _get_token_auth
    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

Migrated from matomo-org/matomo#4835

Log analytics piping in Apache

Trying to get direct piping of log analytics to work on an Apache 2.4.7 / Ubuntu 14.04 system.
From looking at issues matomo-org/matomo#3757, matomo-org/matomo#3163 and matomo-org/matomo#6200 something like this in the Apache conf file should work:

CustomLog "|/usr/bin/python -u /var/www/piwik/misc/log-analytics/import_logs.py --url=http://testsrv/piwik/ - --idsite=1 --log-format-name=common_vhost" vhost_combined

But this doesn't add any data to Piwik (checked in PMA, and accordingly nothing in the web UI). Adding --output=/var/log/piwik/test.log or checking Apache's error.log doesn't provide more clues (I can provoke errors in both files by using incorrect syntax or unknown option names).

The access log is in the vhost_combined format and the following as root does work:

tail -n1 access.log | /usr/bin/python -u /var/www/piwik/misc/log-analytics/import_logs.py --url=http://testsrv/piwik/ - --idsite=1 --log-format-name=common_vhost

It correctly adds the last page hit ("1 requests imported to 1 sites"), but when omitting the last option (--log-format-name=common_vhost) it gives:

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: cannot automatically determine the log format using the first 100000 lines of the log file. 
Maybe try specifying the format with the --log-format-name command line argument.

I've tried various permutations, verbosely copying from misc/log-analytics/README.md and the threads linked above (especially since matomo-org/matomo#3757 concludes that something somehow does work), but can't get a clear handle on what is going wrong. It looks as if it should be working, so any hints or an up to date how-to are appreciated.

Migrated from matomo-org/matomo#6405

Make Piping one single line from the access.log into import_logs.py work in apache

Description: Make Piping one single line from the access.log into import_logs.py works but using the same command directly from apache nothing gets logged

This is about the same issue as reported by ottodude125 (#3163) and elm (#3163). Piping one single line from the access.log into import_logs.py works but using the same command directly from apache nothing gets logged.

To reproduce, use the following apache config (within a VirtualHost container):

# Set up your log format as a normal extended format, with hostname at the start
LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" piwik_log_format
# Log to a file
CustomLog /var/log/apache2/access_piwik.log piwik_log_format
# Log to piwik
CustomLog "|/usr/bin/python /var/www/foo/piwik/misc/log-analytics/import_logs.py --idsite=1 --recorders=1 --enable-http-errors --enable-http-redirects --enable-static --enable-bots --url=http://foo.bar.org/piwik/ --log-format-name='common_vhost' --output=/tmp/piwik_import.log -dd -" piwik_log_format
# Test log piping
CustomLog "|tee -a /tmp/pipetest.log" piwik_log_format

Both /var/log/apache2/access_piwik.log and /tmp/pipetest.log gets populated, so there is no doubt that the import_logs.py script receives the standard input continuously. However, /tmp/piwik_import.log only gets any activity upon restart of apache (like for ottodude125)

Running:

/usr/bin/python /var/www/foo/piwik/misc/log-analytics/import_logs.py --idsite=1 --recorders=1 --enable-http-errors --enable-http-redirects --enable-static --enable-bots --url=http://foo.bar.org/piwik/ --log-format-name='common_vhost' --output=/tmp/piwik_import.log -dd /var/log/apache2/access_piwik.log

... (the same command as in the CustomLog directive but instead reading a file instead of stdin) works fine.

Piwik 1.10.1
Debian 6

Migrated from matomo-org/matomo#3757

Map URL Query string parameter to Custom Variables

The goal of this issue is to let users define a set of URL Query string Parameters that they want to store as Custom Variables of scope page in the request.

For example imagine in the log file we have a URL path such as /page?pageTitle=Hello, world&pageId=com.x.y.us.abc&countryCode=us&accountCode=us1234&accountName=Testaccount

User may want to configure:

  • URL parameter pageId, record it in Custom Variable slot 5 with name = pageId, value = com.x.y.us.abc
  • URL parameter countryCode, record it in Custom Variable slot 6, name = countryCode, value = us
  • URL parameter accountName, record it in Custom Variable slot 7, name = accountCode, value = Testaccount
  • etc.

Possible solution:

  • This could be done for example with a new parameter to the import_logs.py script:
    • --query-string-parameters-to-custom-variables="pageId:5,countryCode:6,accountName:7"
  • I'm not sure whether this functionnality should be done in the import_logs.py or maybe it should instead be done as a new custom plugin
    • in this way it lets any website reuse this Query string -> Custom vars mapping.

Migrated from matomo-org/matomo#6812

icecast2 <session_time> ignored in piwik session time reports

It would be very nice if the <session_time> field from the icecast2 log import, is taken into account to calculate the session times. Otherwise i need to insert manually a start GET request <session_time> seconds before the actual log entry, which can couse problems in some cases, and its also somehow annoying. Or is there a setting to activate this behaviour per website, or is it solvable via plugin?

Migrated from matomo-org/matomo#6501

Execute log analytics tests on Travis

See #7059

I think we should run the log importer tests on travis automated even if it takes one job. To execute them it won't take long. Just executed them locally and getting this error:

..Fatal error: cannot automatically determine the log format using the first 100000 lines of the log file. 
Maybe try specifying the format with the --log-format-name command line argument.

We could maybe even move the log importer into another repository. Configure travis-yml with sudo:true etc. so execution should very fast. Maybe we could also run the tests on windows?

Migrated from matomo-org/matomo#7062

log import hangs in in non deterministic way

In my setup I use log import to load data into piwik. Everything seems to go well except for the fact that sometimes (and it doesn't seem to be repeatable) one of recorder threads hangs and the rest of the threads wait on futex.

For example - main process:

strace -p 24643

Process 24643 attached - interrupt to quit
select(0, NULL, NULL, NULL, {0, 819826}) = 0 (Timeout)
gettimeofday({1388734405, 12820}, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0}^C <unfinished ...>
Process 24643 detached

Hung thread:

strace -p 24646

Process 24646 attached - interrupt to quit
recvmsg(6, ^C <unfinished ...>
Process 24646 detached

Rest of the threads (all look alike):

strace -p 24645

Process 24645 attached - interrupt to quit
futex(0x7f8028001540, FUTEX_WAIT_PRIVATE, 0, NULL^C <unfinished ...>
Process 24645 detached

I tried to add a thread dump capability to the import_logs.py and so I got:
One of waiting threads (all are identical):

Thread: Thread-13(139637655840512)

File: "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.__args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 1167, in _run_bulk
hits = self.queue.get()
File: "/usr/lib64/python2.6/Queue.py", line 168, in get
self.not_empty.wait()
File: "/usr/lib64/python2.6/threading.py", line 239, in wait
waiter.acquire()

Hung thread (I assume, that's the only one different):

Thread: Thread-7(139638058493696)

File: "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.__args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 1170, in _run_bulk
self._record_hits(hits)

Main process:

Thread: Thread-1(139638191130368)

File: "/usr/lib64/python2.6/threading.py", line 504, in __bootstrap
self.__bootstrap_inner()
File: "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File: "/usr/lib64/python2.6/threading.py", line 484, in run
self.__target(_self.__args, *_self.__kwargs)
File: "/SVN/scripts/piwik/import_logs.py", line 820, in _monitor
time.sleep(config.options.show_progress_delay)

As far as I know it used to occur in 1.12 (although it used to be less frequent, I think) and occurs in 2.0.2 as well.

My installation works on CentOS 6.4

rpm -qa | grep python

python-ethtool-0.6-3.el6.x86_64
python-libs-2.6.6-37.el6_4.x86_64
python-setuptools-0.6.10-3.el6.noarch
python-devel-2.6.6-37.el6_4.x86_64
python-iniparse-0.3.1-2.1.el6.noarch
python-dateutil-1.4.1-6.el6.noarch
python-urlgrabber-3.9.1-8.el6.noarch
python-pycurl-7.19.0-8.el6.x86_64
rpm-python-4.8.0-32.el6.x86_64
python-2.6.6-37.el6_4.x86_64
python-pip-1.3.1-4.el6.noarch
newt-python-0.52.11-3.el6.x86_64
libxml2-python-2.7.6-12.el6_4.1.x86_64
libproxy-python-0.3.0-4.el6_3.x86_64

Keywords: log import

Migrated from matomo-org/matomo#4472

Log Analytics: Detect standard Urchin log format not detected

Hello,
Here is how URCHIN tells you to how to format your logs for apache usuage in the httpd.conf file

LogFormat "%h %v %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i" "%{Cookie}i""

Here is the problem

  1. it doesn't show any referral data
  2. It doesn't show User Agent (Vistor Browser is Unknown)

Attached is a sample log

Migrated from matomo-org/matomo#4619

Log Analytics: on IIS Error: Unauthorized

There was a long thread in the forums where the reported issue is:

E:\Piwik\piwik\misc\log-analytics>python import_logs.py --url=http://arf E:\logf
iles2\Intranet\ARF_net_iis05\W3SVC769991110\ex140118.log
Fatal error: Unauthorized 

at least three users are discussing this issue. The solution is in this post

Solution to Unauthorized error

Enable Anonymous authentication in your IIS ... edit it and choose 'Application Pool' instead of a specific user. If this works then you can change the Anonymous authentication to the user of your choice.

PS. You can also enable anonymous authentication only to your piwik virtual directory.

Steps

Maybe we could detect this error and display the solution to user? or create a FAQ?

Migrated from matomo-org/matomo#6286

Log Analytics: Support Godaddy style log files

Good afternoon everyone,

I had difficulty with importing logfiles from GoDaddy successfully into Piwik and URLs replaced for privacy using

piwik/htdocs/misc/log-analytics/import_logs.py --idsite=1 --url=piwikurl --enable-http-errors --enable-http-redirects --enable-static -d /home/bitnami/logfile.log

If the GET field looks like "GET www.site.org/index.htm" the import fails and produces 'Page URL not defined.'
If it looks like "GET /index.htm" or "GET http://www.site.org/index.htm" the import is successful.

I believe the problem is occurring in the archive import tool at piwik/htdocs/misc/log-analytics/import_logs.py and not on the Piwik php side .

Looking at the text supplied to the piwik instance for importing the hits and printing the data of the JSON sent to the server shows that successful imports have 'http://' in the URL provided to piwik.

Tested with a Vagrant/Puppet install provided at http://piwik.org/blog/2012/08/get-started-with-piwik-development-with-puppet-and-vagrant/ (v 2.0.3)
and with the Piwik install provided by Bitnami (v 2.0.2)

Both produced working Piwik installations. I configured the sites in settings to accept site.org and www.site.org.

Workaround:
If host is specified in logfile, add 'http://' or remove host. sed -i 's/GET site.org/GET /g' logfile.log

Migrated from matomo-org/matomo#4526

When importing logs, if --url= is set to HTTP url and Piwik is forced to use SSL then importing logs fails

When importing logs in Piwik and when Piwik is configured to use SSL by default, if you specify --url=http://piwik.example.org then Piwik will fail with this error:

2014-11-20 11:40:10,643: [DEBUG] Error when connecting to Piwik: <urlopen error Piwik returned an invalid response: <!DOCTYPE html>

The solution is to set the --url parameter to https://.... when importing logs.

To fix this issue, we could maybe detect the URL redirect from HTTP to HTTPS and make Log importer detect that HTTPS should be used.

Migrated from matomo-org/matomo#6699

All visitor ip address and provider is same

I have installed the Piwik 2.10 on RedHat server. I copied the iis logs and apache access logs into the Redhat Server. Both iis logs and apache access logs are imported locally into the database successfully and the visitors' information are displayed normally.
I have installed Python 2.7.5 and copied import_logs.py in another Windows 2008 R2 Server. And I have remotely imported the iis logs from the Windows Server into the Piwik on the RedHat Server. But all visitors' ip addresses are same as that of the Winodws Server and providers are also the same.
I have tried to solve the problem but failed.
Please give me advice.
Thanks in advance.

Migrated from matomo-org/matomo#7059

--> Please see solution in this FAQ: https://matomo.org/faq/troubleshooting/faq_17710/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.