Giter Site home page Giter Site logo

Comments (5)

fogelfish avatar fogelfish commented on May 19, 2024

A long time ago I edited import_logs.py to achieve my original objective of importing Apache log entries only if they contained a custom parameter. Specifically my client used basic HTTP authentication and he wanted to analyze only hits that came from his site's paying users.

In addition I captured the value of that custom parameter and gave it to Piwik as a custom variable.

I am no longer working on that project, but when I saw that this issue was moved I realized I had never shared this with anyone here. If anyone is interested in my edits please let me know.

from matomo-log-analytics.

davidfkane avatar davidfkane commented on May 19, 2024

Hello, fogelfish. I definitely think you should share this code with us. It does not make sense that custom variables should not be usable through the log analyis tool. It would be incredibly powerful. Thanks, David

from matomo-log-analytics.

mattab avatar mattab commented on May 19, 2024

Hi guys

FYI we added some new parameters to the import_logs.py script, which are:


  --regex-group-to-visit-cvar=REGEX_GROUP_TO_VISIT_CVAR
                        Track an attribute through a custom variable with
                        visit scope instead of through Piwik's normal
                        approach. For example, to track usernames as a custom
                        variable instead of through the uid tracking
                        parameter, supply --regex-group-to-visit-
                        cvar="userid=User Name". This will track usernames in
                        a custom variable named 'User Name'. See documentation
                        for --log-format-regex for list of available regex
                        groups.
  --regex-group-to-page-cvar=REGEX_GROUP_TO_PAGE_CVAR
                        Track an attribute through a custom variable with page
                        scope instead of through Piwik's normal approach. For
                        example, to track usernames as a custom variable
                        instead of through the uid tracking parameter, supply
                        --regex-group-to-page-cvar="userid=User Name". This
                        will track usernames in a custom variable named 'User
                        Name'. See documentation for --log-format-regex for
                        list of available regex groups.

Maybe you will find these useful

from matomo-log-analytics.

fogelfish avatar fogelfish commented on May 19, 2024

For what it's worth, I'll post here only the edits I made to import_logs.py. I was new to Python and new to Piwik, so those who are maintaining this project can evaluate the worth of my edits. Others may find them useful on an ad-hoc basis.

Here is a sample of a log line I had to deal with:

xxx.xxx.xxx.xxx - - [25/May/2014:00:00:03 -0700] "GET /subjects/astronomy/books/marsearly/ HTTP/1.1" 200 20311 "http://members.obscureddomain.com/subjects/astronomy/books/" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36" i=lauribarron64126:6b65bd4db199bebe99da1c24b64610c9; t=lauribarron64126:6b65bd4db199bebe99da1c24b64610c9

My client wanted to import and count only lines where the t parameter was non-trivial.

In the code segments below I use ellipses (…) to show where there is intervening code that I did not touch.

First, I added a couple of custom regex formats. The extended format does the heavy lifting. It took quite a bit of work to figure out how to write the greedy negative-lookahead with backreference for t=. It helped immensely to use RegexBuddy.

_CUSTOM_BASE_LOG_FORMAT = (
    '(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] '
    '"(?P<request>.+) (?P<path>.*?) (?P<protocol>.*?)" (?P<status>\S+) (?P<length>\S+) '
    '(?P<referrer>.*)" "(?P<user_agent>.*)"'
)

#  the i= parameter and everything after that is lazy (optional)
# objective is to allow the regex to capture lines that do not have i= and t= custom params
_CUSTOM_EXTENDED_LOG_FORMAT = (
    '[\s]*(?:.*(i=(?P<person>[\w]*)))*(:)*(?:[\w]*)(; )*((?:.*t=(?!.*t=)(?P<memberid>[\w]*)):(?:.*$))*'
)
FORMATS = {
    ...
    'custom': RegexFormat('custom', _CUSTOM_BASE_LOG_FORMAT),
    'custom_extended': RegexFormat('custom_extended', _CUSTOM_BASE_LOG_FORMAT + _CUSTOM_EXTENDED_LOG_FORMAT),
}

Most of the following sections are either self-explanatory or I added a few comments:

class Configuration(object):
    ...
    def _create_parser(self):
        ...
        # edited: my additions to command line params
        option_parser.add_option(
            '--require-custom-log-params', dest='require_custom_log_params', default=False, action='store_true',
            help="If set, it will skip the log line unless all user-named custom url params are present"
        )
        option_parser.add_option(
            '--custom-log-param', dest='custom_log_params', action='append', default=[],
            help="If set, it will create a Piwik custom variable from a user-named custom url param. "
                 "Can be specified multiple times."
        )
def _parse_args(self, option_parser):
    ...
    # edited: convert the collection of named custom url parameters to lower case
    self.options.custom_log_params = [s.lower() for s in self.options.custom_log_params]
class Statistics(object):
    ...
    def __init__(self):
        ...
        # edited: my addition to stats
        self.count_lines_skipped_required_custom_url_param = self.Counter()
Logs import summary
-------------------

...
    %(total_lines_ignored)d requests ignored:
        # edited: my addition to stats
        %(count_lines_skipped_required_custom_url_param)d invalid log lines (missing custom param)
…
'total_lines_ignored': sum([
        ...
        # edited: my addition to stats
        self.count_lines_skipped_required_custom_url_param.value,
    ]),
...
'count_lines_skipped_required_custom_url_param': self.count_lines_skipped_required_custom_url_param.value,
class Recorder(object):
...
    def _get_hit_args(self, hit):
        """
        Returns the args used in tracking a hit, without the token_auth.
        """

        # edited: made _cvar an array to permit the addition of more than one cvar
        c = 0
        _cvar = []

...
        # edited these routines to accomodate the dynamic cvar
        if config.options.enable_bots:
            c += 1
            args['bots'] = '1'
            if hit.is_robot:
                cvar = '"%d":["Bot","%s"]' % (c, hit.user_agent)
            else:
                cvar = '"%d":["Not-Bot","%s"]' % (c, hit.user_agent)
            _cvar.append(cvar)

        # edited to assign each broken-out custom log param to a Piwik custom variable
        if hit.has_required_custom_log_params:
            for k, v in hit.custom_log_param.iteritems():
                c += 1
                cvar = '"%d":["%s","%s"]' % (c, k, v)
                _cvar.append(cvar)

        # edited to accomodate dynamic cvar
        if 'cvar' not in args:
            c += 1
            cvar = '"%d":["HTTP-code","%s"]' % (c, hit.status)
            _cvar.append(cvar)

...
        # edited: join individual cvar entries and format as JSON
        if len(_cvar) > 0:
            args['cvar'] = '{' + ', '.join(_cvar) + '}'
        return args
class Parser(object):
    ...
    # edited: force this check to occur after the user_agent check
    # because methods appear to be invoked alphabetically.
    # Do not count downloads in log lines that are skipped.
    # I don't know why this method was getting invoked twice in the same hit,
    # (maybe a problem in my regex)
    # so I hacked in an 'is_static' attribute
    # and now count_lines_downloads gets incremented only once.
    def check_vdownload(self, hit):
        extension = hit.path.rsplit('.')[-1].lower()
        if extension in DOWNLOAD_EXTENSIONS and not hit.is_download:
            stats.count_lines_downloads.increment()
            hit.is_download = True
        return True

...
    # edited: break out custom log params from their own collection
    # into separate hit attributes.
    # Keep track of different kinds of failures or capture value on success.
    # This method name is prepended with z to force it to be done last.
    def check_zcustom_log_param(self, hit):
        for k in itertools.chain(config.options.custom_log_params):
            # true case: it exists in the log
            if k in hit.custom_log_param:
                # true case: it has a value
                if hit.custom_log_param[k]:
                    setattr(hit, k, hit.custom_log_param[k])
                else:
                    if config.options.require_custom_log_params:
                        stats.count_lines_skipped_required_custom_url_param.increment()
                        return False
            else:
                if config.options.require_custom_log_params:
                    stats.count_lines_skipped_required_custom_url_param.increment()
                    return False
        if config.options.require_custom_log_params:
            hit.has_required_custom_log_params = True
        return True

    def parse(self, filename):
        ...
        hits = []
        for lineno, line in enumerate(file):
            ...
            # edited: relocated the regex match so it is done after other detections
            match = format.match(line)
            if not match:
                if not config.options.require_custom_log_params:
                    invalid_line(line, 'line did not match')
                else:
                    stats.count_lines_skipped_required_custom_url_param.increment()
                    if config.options.debug >= 2:
                        logging.debug('Invalid line detected (%s): %s' % ('missing custom log param', line))
                continue

           # edited: added the hit attribute has_required_custom_log_params
           hit = Hit(
                filename=filename,
                lineno=lineno,
                status=format.get('status'),
                full_path=format.get('path'),
                is_download=False,
                is_static=False,
                is_robot=False,
                is_error=False,
                is_redirect=False,
                has_required_custom_log_params=False,
                args={},
            )

            ...
            # edited: collect custom log param values
            if len(config.options.custom_log_params) > 0:
                hit.custom_log_param = {}
                for param in config.options.custom_log_params:
                    hit.custom_log_param[param] = format.get(param)
           ...

from matomo-log-analytics.

mattab avatar mattab commented on May 19, 2024

One workaround / solution is to first prepare the log files with sed grep then import it

from matomo-log-analytics.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.