Giter Site home page Giter Site logo

cyrus-and / chrome-har-capturer Goto Github PK

View Code? Open in Web Editor NEW
527.0 29.0 90.0 417 KB

Capture HAR files from a Chrome instance

License: MIT License

JavaScript 100.00%
chrome-debugging-protocol headless browser google-chrome har http-archive automation

chrome-har-capturer's People

Contributors

calvinnwq avatar cyrus-and avatar ddseo avatar dependabot[bot] avatar gdavidkov avatar paulirish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chrome-har-capturer's Issues

Capturing Har with content

It looks like I should be able to capture har with content in the current build the same way as I might use Tools->Developer tools->Save as HAR with content. Is this correct?
However, my recorded HAR files do not have a "text" field appearing in the entries.response structure as they do when I manually save as HAR with content. (Note: it is not a blank field, it is completely missing.)

Previously, I had Node.js version (v6.9.4 for Windows) installed but I updated to the latest version (v6.10.0) and this update did not fix the issue.
I run chrome-har-capturer as follows from the (cmd) terminal:

start chrome --remote-debugging-port=9222 --enable-benchmarking --enable-net-benchmarking
chrome-har-capturer -d 10000 -o test.har http:\www.google.com

Thanks!

No Cookie HTTP Header

Hi,
I can't see any cookie related header in HTTP responses/request (for example Set-Cookie in responses).
In older versions of chrome-har-capturer I could.

Is the cache cleaned for each visit?

Ciao!

I've a quick question.
If I run multiple times chrome-har-capturer without quitting Chrome between one visit and another, is the cache/profile cleaned?
In other words, does each visit use a clean Chrome profile?

Thank You
Martino

Multiple URLs - Stop working

I have noticed that when I supply multiple URLs, it keeps breaking after opening a couple of URLs. The behavior is not dependent on the particular URL. It works fine with single URL. Just that whenever I give a list of URLs, it opens some URLs and then breaks. Have you faced such issue while loading several URLs?

Moreover, sometimes it even gets stuck at one URL and would not move on further to the next URL, which might be due to the dependency on the onLoad event?

Reimplementation in python

Is it possible to rewrite the module in python? Or are there some features of the chrome API, that you cannot trigger via python?
Would be thankfull for any thoughts according to that.

Never able to connect to Chrome

I always keep getting this:
Unable to connect to Chrome on xx.xx.xx.xx:9222

Although both curl and the browser can get to the 'Inspectable pages' dashboard. I tried specifying the host and port, changing the port, trying to access from same machine and from a remote one (through ssh tunnel) with no luck. Any idea what's going wrong?

Error while running Capturer

When I run
google-chrome --remote-debugging-port=9222 --enable-benchmarking --enable-net-benchmarking
A new window opens but in the terminal I am seeing this

[16210:16210:0315/175447.814565:ERROR:child_thread_impl.cc(762)] Request for unknown Channel-associated interface: ui::mojom::GpuMain

In another terminal when I run
chrome-har-capturer -o out.har https://github.com
I am seeing this error

/usr/local/lib/node_modules/chrome-har-capturer/node_modules/chrome-remote-interface/node_modules/ws/lib/PerMessageDeflate.js:8
const TRAILER = Buffer.from([0x00, 0x00, 0xff, 0xff]);
                       ^

TypeError: this is not a typed array.
    at Function.from (native)
    at Object.<anonymous> (/usr/local/lib/node_modules/chrome-har-capturer/node_modules/chrome-remote-interface/node_modules/ws/lib/PerMessageDeflate.js:8:24)
    at Module._compile (module.js:410:26)
    at Object.Module._extensions..js (module.js:417:10)
    at Module.load (module.js:344:32)
    at Function.Module._load (module.js:301:12)
    at Module.require (module.js:354:17)
    at require (internal/module.js:12:17)
    at Object.<anonymous> (/usr/local/lib/node_modules/chrome-har-capturer/node_modules/chrome-remote-interface/node_modules/ws/lib/WebSocket.js:16:27)
    at Module._compile (module.js:410:26)

I have nodejs version v4.2.6
Chrome Version 57.0.2987.98 (64-bit)

Please help me in getting this project to be able to run on my system.

post request not supported

hi,

  1. i want to know whether chrome-har-capture supports http post request/ get request with headers, query params etc.. , if it so how to do that.
    2)how to aggregate the multiple har generated for multiple task to single har file.
    please reply ASAP

Thanks.

get negative bodySize with http2

an example here

      {
        "pageref": "page_1",
        "startedDateTime": "2016-08-03T23:29:39.184Z",
        "time": 31.01,
        "request": {
          "method": "GET",
          "url": "https://cm.g.doubleclick.net/pixel?google_nid=pm",
          "httpVersion": "h2",
          "cookies": [],
          "headers": [
            {
              "name": ":path",
              "value": "/pixel?google_nid=pm"
            },
            {
              "name": "accept-encoding",
              "value": "gzip, deflate, sdch, br"
            },
            {
              "name": "accept-language",
              "value": "en-US,en;q=0.8"
            },
            {
              "name": "accept",
              "value": "image/webp,image/*,*/*;q=0.8"
            },
            {
              "name": "referer",
              "value": "https://pagead2.googlesyndication.com/pagead/s/cookie_push.html"
            },
            {
              "name": ":authority",
              "value": "cm.g.doubleclick.net"
            },
            {
              "name": "cookie",
              "value": "id=2287200a0d0900a3||t=1465252078|et=730|cs=002213fd48378deb6e813ec5c5; IDE=AHWqTUmk4A5brTpedwGU18rGYS-57sIoKs8PZmiQWj0TqdtVyef4IO-27w; DSID=NO_DATA; __gads=ID=ae5b57ae0914d2d1:T=1470266547:S=ALNI_Mbk4NQhcGaF0ZsM2lTacYmz1kchhg"
            },
            {
              "name": ":scheme",
              "value": "https"
            },
            {
              "name": ":method",
              "value": "GET"
            }
          ],
          "queryString": [
            {
              "name": "google_nid",
              "value": "pm"
            }
          ],
          "headersSize": 760,
          "bodySize": -1
        },
        "response": {
          "status": 200,
          "statusText": "",
          "httpVersion": "h2",
          "cookies": [],
          "headers": [
            {
              "name": "pragma",
              "value": "no-cache"
            },
            {
              "name": "date",
              "value": "Wed, 03 Aug 2016 23:29:39 GMT"
            },
            {
              "name": "server",
              "value": "HTTP server (unknown)"
            },
            {
              "name": "content-type",
              "value": "image/png"
            },
            {
              "name": "status",
              "value": "200"
            },
            {
              "name": "cache-control",
              "value": "no-cache, must-revalidate"
            },
            {
              "name": "content-length",
              "value": "170"
            },
            {
              "name": "alt-svc",
              "value": "quic=\":443\"; ma=2592000; v=\"36,35,34,33,32,31,30\""
            },
            {
              "name": "alternate-protocol",
              "value": "443:quic"
            },
            {
              "name": "x-xss-protection",
              "value": "1; mode=block"
            },
            {
              "name": "expires",
              "value": "Fri, 01 Jan 1990 00:00:00 GMT"
            }
          ],
          "redirectURL": "https://image6.pubmatic.com/AdServer/UCookieSetPug?oid=1&rd=https://cm.g.doubleclick.net/pixel?google_nid=pm&google_sc=1&google_hm=",
          "headersSize": 367,
          "bodySize": -115,
          "_transferSize": 252,
          "content": {
            "size": 170,
            "mimeType": "image/png",
            "compression": 285
          }
        },
        "cache": {},
        "timings": {
          "blocked": 0.53,
          "dns": -1,
          "connect": -1,
          "send": 0.137,
          "wait": 30.007,
          "receive": 0.336,
          "ssl": -1
        },
        "serverIPAddress": "216.58.192.2",
        "connection": "1655",
        "_initiator": {
          "type": "parser",
          "url": "https://pagead2.googlesyndication.com/pagead/s/cookie_push.html",
          "lineNumber": 1
        }
      },

Headless Mode

Why does this module only work with headless mode?
I am also using a few chrome extensions to collect some data with the remote debugging protocol. Is there a way to make it work in a non-headless chrome instance?

All logged timings are incorrect, and some other data too.

Just compare a HAR exported with Chrome with a HAR exported with chrome-har-capturer and you are going to notice huge differences. The following diff shall illustrate that (omitted some parts for brevity). One issue I can directly point you to is here where you add the SSL/TLS time to the total time. This is wrong because, as per HAR specification, the connect time already includes the SSL/TLS time. There are many more differences between the files.

I am afraid that any logs created with this tool are incorrect and you should warn your users about this until the issue is fixed.

--- chrome-capturer.har Wed May 27 10:25:13 2015
+++ chrome.har  Wed May 27 10:25:07 2015
@@ -1,21 +1,21 @@
 […]
-"time": 243.0489062498805,
+"time": 64.56995010375977,
 "request": {
     "method": "GET",
     "url": "https://apis.google.com/_/scs/abc-static/_/js/k=gapi.gapi.en.dPxK-DAj_pE.O/m=gapi_iframes,googleapis_client,plusone/rt=j/sv=1/d=1/ed=1/am=AAQ/rs=AItRSTN0fuoBkyFaoHWfzWWLct0BxZgQSQ/cb=gapi.loaded_0",
-    "httpVersion": "HTTP/1.1",
+    "httpVersion": "unknown",
 […]
 "response": {
     "status": 200,
     "statusText": "OK",
-    "httpVersion": "HTTP/1.1",
+    "httpVersion": "unknown",
 […]
 "timings": {
-    "blocked": -1,
-    "dns": 0.654000000054114,
-    "connect": 53.30499999990936,
-    "send": 1.0790000000042994,
-    "wait": 112.32899999993191,
-    "receive": 36.00390625,
-    "ssl": 39.677999999980806
+    "blocked": 0.974000000041997,
+    "dns": -1,
+    "connect": -1,
+    "send": 0.570000000038813,
+    "wait": 30.76099999998409,
+    "receive": 32.264950103694865,
+    "ssl": -1
 […]

Capturing of HAR for youtube videos from chrome at the end of duration

Hi all,
Is there a way to capture HAR for a youtube video at the end of its duration?Through the following code it is capturing the HAR at the start of the video itself.What modifications need to be done if I don't know the duration in advance.

server=Server('path to browsermob-proxy')
server.start()
proxy=server.create_proxy()
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--proxy-server={0}".format(proxy.proxy))
browser = webdriver.Chrome(executable_path="/home/sivakesava/chromedriver",chrome_options = chrome_options)
proxy.new_har("CS")
browser.get("https://www.youtube.com/watch?v=NFQeItqTfZk")
har=proxy.har
print har
browser.quit()
server.stop()

Weird behavior when passing invalid URLs

It seems that Chrome aborts a request after an invalid one.

First invalid request http://asd:

{
    "method": "Network.requestWillBeSent",
    "params": {
        "requestId": "9.1",
        "frameId": "9.2",
        "loaderId": "9.1",
        "documentURL": "http://asd/",
        "request": {
            "url": "http://asd/",
            "method": "GET",
            "headers": {
                "Pragma": "no-cache",
                "User-Agent": [...],
                "Accept": [...],
                "Cache-Control": "no-cache"
            }
        },
        "timestamp": 1355765429.950094,
        "initiator": {
            "type": "other"
        }
    }
}

Loading failed notify:

{
    "method": "Network.loadingFailed",
    "params": {
        "requestId": "9.1",
        "timestamp": 1355765429.989708,
        "errorText": ""
    }
}

Next valid request http://flickr.com:

{
    "method": "Network.requestWillBeSent",
    "params": {
        "requestId": "9.3",
        "frameId": "9.2",
        "loaderId": "9.4",
        "documentURL": "http://flickr.com/",
        "request": {
            "url": "http://flickr.com/",
            "method": "GET",
            "headers": {
                "Pragma": "no-cache",
                "User-Agent": [...],
                "Accept": [...],
                "Cache-Control": "no-cache"
            }
        },
        "timestamp": 1355765430.012913,
        "initiator": {
            "type": "other"
        }
    }
}

Request cancelled by Chrome:

{
    "method": "Network.loadingFailed",
    "params": {
        "requestId": "9.3",
        "timestamp": 1355765430.319519,
        "errorText": "",
        "canceled": true
    }
}

Position of elements on loaded Pages in the web browser?

Hi,
Any ideas how we can also log the position and pixel information of elements on a web page loaded in the web browser. For example, inspecting elements in the CSS and logging the layout information along with the corresponding timing events of each request and response? Any help or clues in this regards?

Strange page termination behavior

Hi there,

I try to figure out how i can compair those results.
Regarding a random url :
- http://www.qobuz.com/info/MAGAZINE-ACTUALITES/HI-FI-NEWS/Application-Qobuz-pour-le-media173258
One har from firefox/firebug gives me (in harviewer) :
- 92 Requests/1.8 MB/(0 From Cache)/6.08s/(onload: 2.35s)
The other from chrome/har-capturer (also in harviewer) :
- 47 Requests/1.1 MB/(0 From Cache)/2.17s/(onload: 2.25s)

From this simple test, In what basis i can choose a standard/correct behavior between chrome-har-capturer and firebug ?

Let me know if i can help in some way to sort it out.

Regards,

Regis A. Despres

DNS flush only works after the second attempt

Example:

$ node bin/cli.js http://example.com 2> /dev/null | grep '"dns": '
                    "dns": 0,
$ node bin/cli.js http://example.com 2> /dev/null | grep '"dns": '
                    "dns": 28.9919999995618,
$ node bin/cli.js http://example.com 2> /dev/null | grep '"dns": '
                    "dns": 28.6919999998645,
$ node bin/cli.js http://example.com 2> /dev/null | grep '"dns": '
                    "dns": 28.49299999797947,

bodySize from getHeaderValue is returning the header key instead of value

In the generate HAR, some entries have their request bodySize returning as string of value "content-length".

{
    "pageref": "page_1",
    "startedDateTime": "2017-08-24T01:27:11.769Z",
    "time": 149.27100000204518,
    "request": {
        "method": "POST",
        "url": "...",
        "httpVersion": "h2",
        "cookies": [],
        "headers": [
            ...
            {
                "name": "content-length",
                "value": "311"
            },
            ...
        ],
        "queryString": [],
        "headersSize": -1,
        "bodySize": "content-length"
    },
    ...
}

This is because the value being returned is the key and not the value.

function getHeaderValue(headers, name, fallback) {
const pattern = new RegExp(`^${name}$`, 'i');
const value = Object.keys(headers).find((name) => {
return name.match(pattern);
});
return value === undefined ? fallback : value;
}

Requests after onLoad are not present in HAR

I haven't checked what happens for events that start before onload and complete after, but certainly events that start and finish after onload are missing from the generated HAR file.

To check for completeness you'll probably need a timeout period in which you watch for network requests and if none occur within the timeout period decide the page is complete.

There is no content in HAR file

When we use the network tool we can save the HAR file with content as well(that saves the data for each request with encoding scheme as well). I think if you can add that feature as well it will make it more useful.

Random _transferSize values

I have a demo site running HTTP\2.0 here and its just downloads 50 Javascript objects of all the same size.

When I open chrome manually and grab the HAR I always will get a _transferSize of 20250 for every one

When I use the chrome-har-capturer

var chc = require('chrome-har-capturer');
var c = chc.load(['https://http2optimization.com/W_2_2_a/']);
c.on('end', function (har) {
   console.log ( har );
});

Not only do I get _transferSize of 20027 but sometimes I will get random lower values ranging from 0 - 20000 for like 1 or 2 of the 50 files

I also notice the files that get a random less _transferSize will have way quicker timing values as well

Also: I am running chrome headless with the --incognito flag as well so there is no way it should be caching any of it neither

Cannot read property 'clearCache' of undefined

run chrome command: google-chrome --user-data-dir=/tmp --remote-debugging-port=9222

the /tmp dir access permission is 777, but run you demo print error:

// Cannot connect to Chrome [Error: Cannot inject JavaScript:
 {
    "result": {
        "type": "object",
        "objectId": "{\"injectedScriptId\":1,\"id\":1}",
        "subtype": "error",
        "className": "TypeError",
        "description": "TypeError: Cannot read property 'clearCache' of undefined\n    at <anonymous>:1:20\n    at Object.InjectedScript._evaluateOn (<anonymous>:875:140)\n    at Object.InjectedScript._evaluateAndWrap (<anonymous>:808:34)\n    at Object.InjectedScript.evaluate (<anonymous>:664:21)"
    },
    "wasThrown": true,
    "exceptionDetails": {
        "text": "Uncaught TypeError: Cannot read property 'clearCache' of undefined",
        "url": "",
        "line": 1,
        "column": 19,
        "scriptId": "46"
    }
}]

I don't know why!

Wrong bodySize (and feature request: add _transferSize)

I'm test-running timing-fix branch code, found there are still a few fields missing from this chrome-har-capturer than a real Chrome DevTools, one is request headers, I can only see Referer and User-Agent while a real Chrome DevTools gives all headers sent in full, like Host: ..., Accept-Encoding: ... and others, I haven't looked at the chrome remote debugging protocol, not even sure if remote-interface ever provided those, or can it be done in this chrome-har-capturer ??

          "headers": [
            {
              "name": "Referer",
              "value": "http://www.podcastone.com/widgetrecent?progID=877"
            },
            {
              "name": "User-Agent",
              "value": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2716.0 Safari/537.36"
            }
          ],

Another one I found from a real Chrome DevTools is the _transferSize; when in transfer-encoding: chunked mode, the response is (a length + a chunk) + (length + chunk) + ... + 0; so the real transferred bytes on the wire could be slightly larger than bodySize; I'm not sure which har viewer ever take use of the _transferSize but it can be used in real bandwith calculation or something, and it's missing from this chrome-har-capturer

          "content": {
            "size": 118465,
            "mimeType": "text/html",
            "compression": 96200
          },
          "redirectURL": "",
          "headersSize": 662,
          "bodySize": 22265,
          "_transferSize": 22927

cli.js error

Hi,

chrome-har-capturer doesn't seem to work on my fresh installed Ubuntu 16.04.3.

  • I installed nodejs 6.11.3 which came with npm
  • I installed this package with the following code
    sudo npm install -g https://github.com/cyrus-and/chrome-har-capturer

But launching it generate an error in "cli.js" that I can't figure it out...

gautier@gautier-Ubuntu:~$ chrome-har-capturer -o test.har "https://google.fr"
/usr/lib/node_modules/chrome-har-capturer/bin/cli.js:45
async function preHook(url, client) {
      ^^^^^^^^

SyntaxError: Unexpected token function
    at createScript (vm.js:56:10)
    at Object.runInThisContext (vm.js:97:10)
    at Module._compile (module.js:542:28)
    at Object.Module._extensions..js (module.js:579:10)
    at Module.load (module.js:487:32)
    at tryModuleLoad (module.js:446:12)
    at Function.Module._load (module.js:438:3)
    at Module.runMain (module.js:604:10)
    at run (bootstrap_node.js:389:7)
    at startup (bootstrap_node.js:149:9)

Does someone know where am I wrong ? Did I miss something ?

Many thanks.

BR,
Gautier

Unable to run chrome-har-capturer

I installed with the command:
sudo npm install -g chrome-har-capturer
but when I run
google-chrome --remote-debugging-port=9222
--enable-benchmarking
--enable-net-benchmarking
chrome-har-capturer -o out.har
https://github.com

I am getting this in the terminal and it is stopping:
/usr/bin/env: β€˜node’: No such file or directory

Please suggest what has to be done.

Unable to capturer Video request in HAR, as compare to Chrome network tab HAR

Im trying to write a performance and networking QA script that would capture the HAR/network traffic of a given url.For this Im using chromeHar capturer module of on NodeJS.

The issue having is that if there is a video on a page its not showing it in the HAR, because of that it is impossible to get the actual weight of the page. For which im trying to write the script.

I created a sample page with just one video playing. If i checked the networks tab in chrome it shows me 4 request and the size of total page is 3.2 MB. see the screenshot below.

chromenetworktab

If i use the chromeHAR module and run the script the HAR that I get is totally different.

Please see the screenshot

chromehar

Even if i use timeOut argument to give a time out or use content:true . this is not working.

function getData(link) {

return new Promise(function (res, rej) {
    harCapturing.run(link,{timeout:15000}).on('har', function (har) {
        var logs = har;
        fs.writeFileSync('/Users/harisrizwan/Desktop/out.har', JSON.stringify(har))
        // console.log(logs); 
        var filteredLogs = logs.log.entries;
        fs.writeFileSync('out.json', JSON.stringify(har, null, 2));
        console.log(filteredLogs);
        res(filteredLogs);
    });
});

Can't see requests whose response is redirect

Hi,
thank you for the very useful script.
I've a little problem.
I noticed that in the HAR file are saved only requests whose response is 200 OK.
In particular I cannot see requests that lead to redirects, but I see only the second request for the page resulted from the redirect.
Is there a way to change this behavior?
Thank You

where or how to use remote debugging host

although this claims to be supporting a different host other than localhost, but how does it support that? from chrome command line I only see google-chrome --remote-debugging-port=9222 to specify a port number but don't see how to specify a different listening host, that chrome will always listen on localhost:9222 only, caused that remote debugging protocol never goes out of localhost 127.0.0.1

  -t, --host <host>    Remote Debugging Protocol host

Compatibility of version 0.9.5 with headless chrome

I wanted to know whether chrome-har-capturer version 0.9.5 is compatible with headless chrome , I am using this version in my project. If there is way I can use it with headless chrome, please guide me.

onContentLoad > onLoad

Maybe related to #15 ?

Example:

$ chrome-har-capturer http://h2.svager.cz -c
DONE http://h2.svager.cz
{
    "log": {
        "version": "1.2",
        "creator": {
            "name": "Chrome HAR Capturer",
            "version": "0.9.1"
        },
        "pages": [
            {
                "startedDateTime": "2016-11-18T12:45:56.160Z",
                "id": "0",
                "title": "http://h2.svager.cz",
                "pageTimings": {
                    "onContentLoad": -28.232999980449677,
                    "onLoad": 125.15599995851517
                }
            }
        ],
        "entries": [
            {
                "pageref": "0",
                "startedDateTime": "2016-11-18T12:45:56.157Z",
                "time": 3.1769999768584967,
                "request": {
                    "method": "GET",
                    "url": "http://h2.svager.cz/",
                    "httpVersion": "http/1.1",
                    "cookies": [],
                    "headers": [
                        {
                            "name": "Upgrade-Insecure-Requests",
                            "value": "1"
                        },
                        {
                            "name": "User-Agent",
                            "value": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome Safari/537.36"
                        }
                    ],
                    "queryString": [],
                    "headersSize": 180,
                    "bodySize": -1
                },
                "response": {
                    "status": 307,
                    "statusText": "Internal Redirect",
                    "httpVersion": "http/1.1",
                    "cookies": [],
                    "headers": [
                        {
                            "name": "Location",
                            "value": "https://h2.svager.cz/"
                        },
                        {
                            "name": "Non-Authoritative-Reason",
                            "value": "HSTS"
                        }
                    ],
                    "redirectURL": "https://h2.svager.cz/",
                    "headersSize": 99,
                    "bodySize": 0,
                    "_transferSize": 99,
                    "content": {
                        "size": 0,
                        "mimeType": "",
                        "compression": 0
                    }
                },
                "cache": {},
                "timings": {
                    "blocked": 0.0330000184476376,
                    "dns": -1,
                    "connect": -1,
                    "send": 0,
                    "wait": 0,
                    "receive": 3.143999958410859,
                    "ssl": -1
                },
                "connection": "0",
                "_initiator": {
                    "type": "other"
                },
                "_priority": "VeryHigh"
            },
            {
                "pageref": "0",
                "startedDateTime": "2016-11-18T12:45:56.160Z",
                "time": 94.04499997617677,
                "request": {
                    "method": "GET",
                    "url": "https://h2.svager.cz/",
                    "httpVersion": "spdy",
                    "cookies": [],
                    "headers": [
                        {
                            "name": "User-Agent",
                            "value": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome Safari/537.36"
                        }
                    ],
                    "queryString": [],
                    "headersSize": -1,
                    "bodySize": -1
                },
                "response": {
                    "status": 200,
                    "statusText": "",
                    "httpVersion": "spdy",
                    "cookies": [],
                    "headers": [
                        {
                            "name": "date",
                            "value": "Fri, 18 Nov 2016 12:45:56 GMT"
                        },
                        {
                            "name": "server",
                            "value": "nginx"
                        },
                        {
                            "name": "strict-transport-security",
                            "value": "max-age=31536000; includeSubdomains;"
                        },
                        {
                            "name": "content-type",
                            "value": "text/plain; charset=utf-8"
                        },
                        {
                            "name": "status",
                            "value": "200"
                        },
                        {
                            "name": "cache-control",
                            "value": "max-age=3600"
                        },
                        {
                            "name": "content-length",
                            "value": "17"
                        },
                        {
                            "name": "expires",
                            "value": "Fri, 18 Nov 2016 13:45:56 GMT"
                        }
                    ],
                    "headersSize": -1,
                    "bodySize": -1,
                    "_transferSize": 26,
                    "content": {
                        "size": 17,
                        "mimeType": "text/plain",
                        "text": "\nHello HTTP/2.0!\n"
                    }
                },
                "cache": {},
                "timings": {
                    "blocked": 0.981999968644232,
                    "dns": -1,
                    "connect": -1,
                    "send": 0.900000042747708,
                    "wait": 90.78099997714166,
                    "receive": 1.3819999876431694,
                    "ssl": -1
                },
                "serverIPAddress": "46.28.109.208",
                "connection": "4751",
                "_initiator": {
                    "type": "other"
                },
                "_priority": "VeryHigh"
            }
        ]
    }
}

Note: Reported HTTP version "spdy" is probably bug in chrome headless: https://groups.google.com/a/chromium.org/forum/#!topic/headless-dev/lysNMNgqFrI

Fail to load single-resource URLs

Singe-resource pages (i.e. just an HTML page, no images, no external CSS... data URIs are resource too) fail to load since in this scenario Chrome emits Page.loadEventFired before the Network.loadingFinished of the page itself and this results (incorrectly) in a pending request.

Reproduce with:

chrome-har-capturer -v http://example.com -o out.har

that gives:

# Connected to Chrome: http://localhost:9222/json
# Connected to WebSocket: ws://localhost:9222/devtools/page/22_1
# --> Page.enable
# --> Network.enable
# --> Network.setCacheDisabled {"cacheDisabled":true}
# --> Page.navigate {"url":"http://example.com"}
# <-- [80.456] Network.requestWillBeSent
# <-- [80.456] Network.requestWillBeSent
# <-- [80.456] Network.requestWillBeSent
# <-- [80.456] Network.responseReceived
# Unhandled message: Page.frameNavigated
# <-- [80.456] Network.dataReceived
# <-- Page.loadEventFired: http://example.com
FAIL http://example.com
# Emitting 'end' event

startedDateTime ignores queuing

I just compared HAR files generated by Chrome network tab and the capturer. The capturer seems to return a incorrect startedDateTime, missing the offset by queuing, which can be seen the Chrome network tab and HAR files saved from Chrome. Here are two har files viewed with har viewer, the same page was loaded in booth cases.

HAR saved by Chrome
chrome

HAR saved by capturer
capturer

Therefore I think the Network.requestWillBeSent() walltime value from the devtools protocols, which is used for the startedDateTime (see code), does not actually mean, the request will be sent at that point? I just tried the timestamp value to calculate the startedDateTime, this produces the same problem.

Could startedDateTime be calculated inverse from Network.responseReceived() minus the time it entry.time value and than convert back to walltime?

Feature request: Load neutral page before closing connection

Feature
Add default behavior or option for loading neutral page just before closing connection with the chrome.

Reason
The last loaded page is still open in the chrome (after HAR generation is complete) which may result in unnecessary CPU utilization of the chrome (it depends on the loaded page).

Note
I know It's possible to achieve this behavior with chrome-remote-interface, but I think it's better to use single connection, plus it's better to do this in one place (https://github.com/cyrus-and/chrome-har-capturer/blob/master/lib/client.js#L117). You can think of it as some kind of cleanup.

404 responses for javascript and stylesheets are not capturered

The capturer seems to missing nonexisting javascript and stylesheet resources. For example, capture below page only captured root document and the photo.jpg. (Both requests for photo.jpg and missing.js returns 404).

<html>
  <body>
  hello,world
  <img src="photo.jpg"/>
  <script src="missing.js"></script>
  </body>
</html>

Comparing with saving HAR from chrome inspector, 3 entries are captured.

Upon further investigation, I found Chrome sends LoadingFailed events for Javascript and CSS resources, but LoadingFinished events for images. And the following change works for me.

diff --git a/lib/har.js b/lib/har.js
index df4f397..d3cd924 100644
--- a/lib/har.js
+++ b/lib/har.js
@@ -57,7 +57,7 @@ function parsePage(pageId, page) {

 function parseEntry(pageref, entry) {
     // skip requests without response (requestParams is always present)
-    if (!entry.responseParams || !entry.responseFinishedS) {
+    if (!entry.responseParams) {
         return null;
     }
     // skip entries without timing information (doc says optional)

can this have a timeout option?

for some reason, either server side slow or client side network congestion, some pages seemingly take for ever to fully load, in this case can we have a timeout option to give up? I have a PR if this feature is agreed useful

Usage: chrome-har-capturer [options] URL...

Options:

  -h, --help           output usage information
  -t, --host <host>    Remote Debugging Protocol host
  -p, --port <port>    Remote Debugging Protocol port
  -o, --output <file>  dump to file instead of stdout
  -c, --content        also capture the requests body
  -a, --agent <agent>  user agent override
  -d, --delay <ms>     time to wait after the load event
      --timeout <seconds>  Wait at most how many seconds before giving up
  -f, --force          continue even without benchmarking extension
  -v, --verbose        enable verbose output on stderr

Separate HAR file and consecutive load error

So this whole issue stems from the fact I can't do a require('chrome-har-capturer').load(large_list); as I have found there becomes a point where it can't handle the buffer to write the HAR output and I have a list of like 1000 sites I need to HAR scrape. Since I can't use a long array the hack around is to use recursion like this

var fs = require('fs');
var chc = require('chrome-har-capturer');
var list = ["https://github.com", "https://www.reddit.com"];

function loadSite(i) {
    if (i >= list.length) return; //ends recursion

    var c = chc.load(list[i]);

    c.on('connect', function () {
        console.log("Connected to Chrome: " + i);
    });
    c.on('end', function (har) {
        console.log("Done: " + i);

//	loadSite(++i); // uncomment to see Invalid tab index

//	setTimeout(function(){ loadSite(++i) }, 100); // uncomment to see it NOT have error

    });
    c.on('error', function (err) {
        console.error("Cannot connect to Chrome: " + err);
    });
}

// Kicks off recursion
loadSite(0);

The issue is chrome takes like 50ms when opens a new tab to populate the devtoolsFrontendUrl and webSocketDebuggerUrl properties.

I guess the real question is there is two ways of dealing with this

  • Most people are probably not going to need to scrape hundreds of sites at once and can just parse with the page reference when passed an array and leave the setTimeout as a unique case hack
  • It would be nice to have separate HAR files in general if you had like just 5 separate sites and didn't want them all merge. So maybe have a parser built in to prevent people from keep making their own
  • Have a quick (50-100ms) check again feature in the check for the webSocketDebuggerUrl
    • Something where it calls again before throwing the error
    • I notice that the NPM module isn't updated so the current issue is fetchDebuggingUrl function

I would be happy to make the changes and submit the PR, but wanted to get your opinion on the subject first

from-log release date?

Are there are any plans for merging from-log branch to master an release to npm?
Capabilities of fromLog are a great connection for gathering HAR with Puppeteer.

Not an Issue. A doubt.

If I use a bandwidth shaper like netem to limit the bandwidth to different rates will this have any effect on this module? I mean how does chrome-har-capturer download the HAR? When bandwidth is limited to different rates some packets would be dropped so how would this effect the download of HAR?

It hangs when an error with SSL/TLS certificate occurs

Hey @cyrus-and, another report.

Example:

$ curl https://www.facebook.cz/
curl: (51) SSL: no alternative certificate subject name matches target host name 'www.facebook.cz'


$ chrome-har-capturer https://www.facebook.cz/
...hangs...

If you try to enter URL in chrome, you will see a warning page. I guess that might be an issue?
Also giveUpTime option seems to have no effect in this case.

Not working with Chrome for Android

I can't open with "--enable-net-benchmarking --enable-benchmarking" option on my phone!

Android Chrome does not run with this option.

I think turning off DNS clear function...

any ideas?

choi_teemo:WEB_CE choi$ chrome-har-capturer naver.com
Cannot connect to Chrome
Error: Cannot inject JavaScript: {
"result": {
"type": "object",
"objectId": "{"injectedScriptId":1,"id":1}",
"subtype": "error",
"className": "TypeError",
"description": "TypeError: Cannot read property 'clearCache' of undefined\n at :1:20\n at Object.InjectedScript._evaluateOn (:875:140)\n at Object.InjectedScript._evaluateAndWrap (:808:34)\n at Object.InjectedScript.evaluate (:664:21)"
},
"wasThrown": true,
"exceptionDetails": {
"text": "Uncaught TypeError: Cannot read property 'clearCache' of undefined",
"url": "",
"line": 1,
"column": 19,
"scriptId": "51"
}
}

`

Missing onContentLoad

Hi,

I'm finding that for some websites onContentLoad is never logged (but onLoad is). Is this a bug, or does this occur when Page.loadEventFired is triggered before Page.domContentEventFired?

redirectURL is not matching location response header

According to http://www.softwareishard.com/blog/har-12-spec/

redirectURL [string] - Redirection target URL from the Location response header.

Example:

$ curl http://github.com/ -I
HTTP/1.1 301 Moved Permanently
Content-length: 0
Location: https://github.com/
Connection: close

$ chrome-har-capturer http://github.com/
...
               "request": {
                    "method": "GET",
                    "url": "http://github.com/",
                    "httpVersion": "http/1.1",
                    "cookies": [],
                    "headers": [
                        {
                            "name": "Upgrade-Insecure-Requests",
                            "value": "1"
                        },
                        {
                            "name": "User-Agent",
                            "value": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome Safari/537.36"
                        }
                    ],
                    "queryString": [],
                    "headersSize": 178,
                    "bodySize": -1
                },
                "response": {
                    "status": 307,
                    "statusText": "Internal Redirect",
                    "httpVersion": "http/1.1",
                    "cookies": [],
                    "headers": [
                        {
                            "name": "Location",
                            "value": "https://github.com/"
                        },
                        {
                            "name": "Non-Authoritative-Reason",
                            "value": "HSTS"
                        }
                    ],
                    "redirectURL": "",
                    "headersSize": 97,
                    "bodySize": 0,
                    "_transferSize": 97,
                    "content": {
                        "size": 0,
                        "mimeType": "",
                        "compression": 0
                    }
                },
...

From Chrome (Save as HAR):
...
        "request": {
          "method": "GET",
          "url": "http://github.com/",
          "httpVersion": "unknown",
          "headers": [
            {
              "name": "Upgrade-Insecure-Requests",
              "value": "1"
            },
            {
              "name": "User-Agent",
              "value": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.44 Safari/537.36"
            }
          ],
          "queryString": [],
          "cookies": [],
          "headersSize": -1,
          "bodySize": 0
        },
        "response": {
          "status": 307,
          "statusText": "Internal Redirect",
          "httpVersion": "unknown",
          "headers": [
            {
              "name": "Location",
              "value": "https://github.com/"
            },
            {
              "name": "Non-Authoritative-Reason",
              "value": "Delegate"
            }
          ],
          "cookies": [],
          "content": {
            "size": 0,
            "mimeType": "x-unknown"
          },
          "redirectURL": "https://github.com/",
          "headersSize": -1,
          "bodySize": -1,
          "_transferSize": 0
        },
...

Additional pageTimings

Would it be possible to add additional pageTimings to the HAR output, e.g. onContentLoad?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.