Comments (5)
A long time ago I edited import_logs.py to achieve my original objective of importing Apache log entries only if they contained a custom parameter. Specifically my client used basic HTTP authentication and he wanted to analyze only hits that came from his site's paying users.
In addition I captured the value of that custom parameter and gave it to Piwik as a custom variable.
I am no longer working on that project, but when I saw that this issue was moved I realized I had never shared this with anyone here. If anyone is interested in my edits please let me know.
from matomo-log-analytics.
Hello, fogelfish. I definitely think you should share this code with us. It does not make sense that custom variables should not be usable through the log analyis tool. It would be incredibly powerful. Thanks, David
from matomo-log-analytics.
Hi guys
FYI we added some new parameters to the import_logs.py script, which are:
--regex-group-to-visit-cvar=REGEX_GROUP_TO_VISIT_CVAR
Track an attribute through a custom variable with
visit scope instead of through Piwik's normal
approach. For example, to track usernames as a custom
variable instead of through the uid tracking
parameter, supply --regex-group-to-visit-
cvar="userid=User Name". This will track usernames in
a custom variable named 'User Name'. See documentation
for --log-format-regex for list of available regex
groups.
--regex-group-to-page-cvar=REGEX_GROUP_TO_PAGE_CVAR
Track an attribute through a custom variable with page
scope instead of through Piwik's normal approach. For
example, to track usernames as a custom variable
instead of through the uid tracking parameter, supply
--regex-group-to-page-cvar="userid=User Name". This
will track usernames in a custom variable named 'User
Name'. See documentation for --log-format-regex for
list of available regex groups.
Maybe you will find these useful
from matomo-log-analytics.
For what it's worth, I'll post here only the edits I made to import_logs.py. I was new to Python and new to Piwik, so those who are maintaining this project can evaluate the worth of my edits. Others may find them useful on an ad-hoc basis.
Here is a sample of a log line I had to deal with:
xxx.xxx.xxx.xxx - - [25/May/2014:00:00:03 -0700] "GET /subjects/astronomy/books/marsearly/ HTTP/1.1" 200 20311 "http://members.obscureddomain.com/subjects/astronomy/books/" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" i=lauribarron64126:6b65bd4db199bebe99da1c24b64610c9; t=lauribarron64126:6b65bd4db199bebe99da1c24b64610c9
My client wanted to import and count only lines where the t parameter was non-trivial.
In the code segments below I use ellipses (…) to show where there is intervening code that I did not touch.
First, I added a couple of custom regex formats. The extended format does the heavy lifting. It took quite a bit of work to figure out how to write the greedy negative-lookahead with backreference for t=. It helped immensely to use RegexBuddy.
_CUSTOM_BASE_LOG_FORMAT = (
'(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] '
'"(?P<request>.+) (?P<path>.*?) (?P<protocol>.*?)" (?P<status>\S+) (?P<length>\S+) '
'(?P<referrer>.*)" "(?P<user_agent>.*)"'
)
# the i= parameter and everything after that is lazy (optional)
# objective is to allow the regex to capture lines that do not have i= and t= custom params
_CUSTOM_EXTENDED_LOG_FORMAT = (
'[\s]*(?:.*(i=(?P<person>[\w]*)))*(:)*(?:[\w]*)(; )*((?:.*t=(?!.*t=)(?P<memberid>[\w]*)):(?:.*$))*'
)
FORMATS = {
...
'custom': RegexFormat('custom', _CUSTOM_BASE_LOG_FORMAT),
'custom_extended': RegexFormat('custom_extended', _CUSTOM_BASE_LOG_FORMAT + _CUSTOM_EXTENDED_LOG_FORMAT),
}
Most of the following sections are either self-explanatory or I added a few comments:
class Configuration(object):
...
def _create_parser(self):
...
# edited: my additions to command line params
option_parser.add_option(
'--require-custom-log-params', dest='require_custom_log_params', default=False, action='store_true',
help="If set, it will skip the log line unless all user-named custom url params are present"
)
option_parser.add_option(
'--custom-log-param', dest='custom_log_params', action='append', default=[],
help="If set, it will create a Piwik custom variable from a user-named custom url param. "
"Can be specified multiple times."
)
def _parse_args(self, option_parser):
...
# edited: convert the collection of named custom url parameters to lower case
self.options.custom_log_params = [s.lower() for s in self.options.custom_log_params]
class Statistics(object):
...
def __init__(self):
...
# edited: my addition to stats
self.count_lines_skipped_required_custom_url_param = self.Counter()
Logs import summary
-------------------
...
%(total_lines_ignored)d requests ignored:
# edited: my addition to stats
%(count_lines_skipped_required_custom_url_param)d invalid log lines (missing custom param)
…
'total_lines_ignored': sum([
...
# edited: my addition to stats
self.count_lines_skipped_required_custom_url_param.value,
]),
...
'count_lines_skipped_required_custom_url_param': self.count_lines_skipped_required_custom_url_param.value,
class Recorder(object):
...
def _get_hit_args(self, hit):
"""
Returns the args used in tracking a hit, without the token_auth.
"""
# edited: made _cvar an array to permit the addition of more than one cvar
c = 0
_cvar = []
...
# edited these routines to accomodate the dynamic cvar
if config.options.enable_bots:
c += 1
args['bots'] = '1'
if hit.is_robot:
cvar = '"%d":["Bot","%s"]' % (c, hit.user_agent)
else:
cvar = '"%d":["Not-Bot","%s"]' % (c, hit.user_agent)
_cvar.append(cvar)
# edited to assign each broken-out custom log param to a Piwik custom variable
if hit.has_required_custom_log_params:
for k, v in hit.custom_log_param.iteritems():
c += 1
cvar = '"%d":["%s","%s"]' % (c, k, v)
_cvar.append(cvar)
# edited to accomodate dynamic cvar
if 'cvar' not in args:
c += 1
cvar = '"%d":["HTTP-code","%s"]' % (c, hit.status)
_cvar.append(cvar)
...
# edited: join individual cvar entries and format as JSON
if len(_cvar) > 0:
args['cvar'] = '{' + ', '.join(_cvar) + '}'
return args
class Parser(object):
...
# edited: force this check to occur after the user_agent check
# because methods appear to be invoked alphabetically.
# Do not count downloads in log lines that are skipped.
# I don't know why this method was getting invoked twice in the same hit,
# (maybe a problem in my regex)
# so I hacked in an 'is_static' attribute
# and now count_lines_downloads gets incremented only once.
def check_vdownload(self, hit):
extension = hit.path.rsplit('.')[-1].lower()
if extension in DOWNLOAD_EXTENSIONS and not hit.is_download:
stats.count_lines_downloads.increment()
hit.is_download = True
return True
...
# edited: break out custom log params from their own collection
# into separate hit attributes.
# Keep track of different kinds of failures or capture value on success.
# This method name is prepended with z to force it to be done last.
def check_zcustom_log_param(self, hit):
for k in itertools.chain(config.options.custom_log_params):
# true case: it exists in the log
if k in hit.custom_log_param:
# true case: it has a value
if hit.custom_log_param[k]:
setattr(hit, k, hit.custom_log_param[k])
else:
if config.options.require_custom_log_params:
stats.count_lines_skipped_required_custom_url_param.increment()
return False
else:
if config.options.require_custom_log_params:
stats.count_lines_skipped_required_custom_url_param.increment()
return False
if config.options.require_custom_log_params:
hit.has_required_custom_log_params = True
return True
def parse(self, filename):
...
hits = []
for lineno, line in enumerate(file):
...
# edited: relocated the regex match so it is done after other detections
match = format.match(line)
if not match:
if not config.options.require_custom_log_params:
invalid_line(line, 'line did not match')
else:
stats.count_lines_skipped_required_custom_url_param.increment()
if config.options.debug >= 2:
logging.debug('Invalid line detected (%s): %s' % ('missing custom log param', line))
continue
# edited: added the hit attribute has_required_custom_log_params
hit = Hit(
filename=filename,
lineno=lineno,
status=format.get('status'),
full_path=format.get('path'),
is_download=False,
is_static=False,
is_robot=False,
is_error=False,
is_redirect=False,
has_required_custom_log_params=False,
args={},
)
...
# edited: collect custom log param values
if len(config.options.custom_log_params) > 0:
hit.custom_log_param = {}
for param in config.options.custom_log_params:
hit.custom_log_param[param] = format.get(param)
...
from matomo-log-analytics.
One workaround / solution is to first prepare the log files with sed
grep
then import it
from matomo-log-analytics.
Related Issues (20)
- Error when importing Apache combined HOT 1
- I offer a working regex for default Nginx Proxy Manager log HOT 1
- log-format-regex does not work even if regex is working fine in python console HOT 1
- Matomo Log Analytics : config.ini.php could not be read. HOT 9
- Problem importing logs to cPanel install HOT 1
- Don't get any records on Apache HOT 1
- Automate --skip= in the command
- Tracking TLS protocols in IIS logs HOT 2
- import failes with wrong error : file HOT 2
- password with && crashed script HOT 3
- if hostname has no http in it importer together with docker-matomo fails to file ip- adresses clients correctly HOT 8
- Fatal error: the configuration file/config/config.ini.php could not be read. HOT 2
- Incorrect SharePoint Online location HOT 3
- Have anyone try to log import from SHOUTcastv2 or Wowza Media server
- import_logs.py: Silent failure if log file cannot be accessed
- Country guessing from browser language not working properly. HOT 1
- no bot tracked in logs HOT 6
- Subdomain listed as "Page url not defined" in Visits Log HOT 4
- import_logs.py does not work with secure auth token in Matomo 5 HOT 3
- Broken documentation url in script header HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from matomo-log-analytics.