Closed Bug 663423 Opened 13 years ago Closed 13 years ago

Memory usage reports via telemetry

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla7

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink:P1][inbound])

Attachments

(4 files, 2 obsolete files)

Now that telemetry is working, it'd be great to get some kind of data gathering and analysis relating to memory usage.
Blocks: 585196
We have *some* kind of reporting already. http://people.mozilla.com/~tglek/telemetry/log.html lists aggregate info on some of the about:memory stuff
Great!  So this bug is a matter of working out what more we can do with that data, if we can automate any kind of tracking, etc.
(In reply to comment #2)
> Great!  So this bug is a matter of working out what more we can do with that
> data, if we can automate any kind of tracking, etc.

Well, I put those probes in as an example. I would be happy if you could suggest refinements to that or a better approach altogether.
Here's the list of currently-tracked memory reporters, from TelemtryPing.js:

  "explicit/js/gc-heap": [1024, 512 * 1024, 10],
  "resident": [32 * 1024, 1024 * 1024, 10],
  "explicit/layout/all": [1024, 64 * 1025, 10],
We should do "explicit" as well, that's arguably the single most interesting number.  That'll require bug 660731.
Assignee: nobody → nnethercote
Depends on: 660731
explicit/js/gc-heap-chunk-unused (from bug 661474) would also be interesting, for an indication of fragmentation on the JS heap.
(In reply to comment #7)
> explicit/js/gc-heap-chunk-unused (from bug 661474) would also be
> interesting, for an indication of fragmentation on the JS heap.

is there a reason you didn't include a telemetry probe in that bug?
(In reply to comment #8)
> 
> is there a reason you didn't include a telemetry probe in that bug?

I am planning to do all the new telemetry additions in a single patch here :)
Attached patch patch 1 (obsolete) — Splinter Review
So the memory reporting in telemetry is completely busted.  This first patch fixes it.  Is this stuff tested anywhere?
Attachment #543339 - Flags: review?(tglek)
Attached patch patch 2Splinter Review
This patch: 
- Adds three extra memory measurements to telemetry.
- Removes |memReporters| from |gatherMemory|, because it is dead.

It relies on mgr.explicit which is added by the patch in bug 660731.
Attachment #543340 - Flags: review?(tglek)
BTW, I didn't do explicit/js/gc-heap-chunk-unused because that's computed via a multi-reporter (bug 666075) and there's currently no way to extract a single number from a multi-reporter.
Attachment #543339 - Flags: review?(tglek) → review+
Attachment #543340 - Flags: review?(tglek) → review+
(In reply to comment #10)
> Created attachment 543339 [details] [diff] [review] [review]
> patch 1
> 
> So the memory reporting in telemetry is completely busted.  This first patch
> fixes it.  Is this stuff tested anywhere?

Not yet. I'm not sure of the best way to test it. I was thinking of adding a NS_WARNING to the for loop if every memory reporter wasn't found (or if some of them are reporting 0, which seems crazy). Turns out there isn't an NS_WARNING equivalent from JS.
(In reply to comment #13)
> Turns out there isn't an NS_WARNING equivalent from JS.

JS_ASSERT!  (Yes, I'm serious;  if this is busted I want to know, not have a warning that gets lost in the 100s of other warnings :)
(In reply to comment #14)
> (In reply to comment #13)
> > Turns out there isn't an NS_WARNING equivalent from JS.
> 
> JS_ASSERT!  (Yes, I'm serious;  if this is busted I want to know, not have a
> warning that gets lost in the 100s of other warnings :)

I meant from .js
Also while we do this, can we check if val != 0 before adding it to a histogram, I don't know why MEMORY_RESIDENT would be 0 so often.
The above patches use NS_ASSERT, which isn't available in this module(and is apparently dangerous too as it shows up as a modal dialog).

Turns out there is no good way to assert(that I can see) from js.
Attachment #543528 - Flags: review?(nnethercote)
I don't want to take the risk of branching for aurora with broken memory reporters. Lets checkin the fix right away.
Attachment #543339 - Attachment is obsolete: true
Attachment #543529 - Flags: review?(justin.lebar+bug)
Comment on attachment 543529 [details] [diff] [review]
patch 1(for checkin)

>+      if (val)
>+        h.add(val);

Why are we excluding zero values?  It's perfectly valid for the number of hard page faults since the last sample to be 0, for instance.
(In reply to comment #19)
> Why are we excluding zero values?  It's perfectly valid for the number of
> hard page faults since the last sample to be 0, for instance.

Taras and I had a long argument about this on IRC.

He thinks that 0 values aren't interesting, because they're a non-event.  In the unlikely scenario that you want to compute the proportion of time spent not swapping versus swapping, you can infer the number of 0s that are missing by looking at some other memory reporter.  Also, since presumably the 0s bucket will be much larger than the other buckets, including it will make the histogram ugly.

I think the histogram makes no sense without the 0 values.  You *always* want to compare the number of swap events to the number of non-swap events.  Otherwise you're just learning a number which is strongly correlated with how long the browser has been running.  If the histogram can't display this well, then it's bugged.

Anyway, Taras needs this hack in there because some other reporters aren't working properly -- RSS reports 0 about half the time, for reasons which are unclear to me.  We agreed that the patch gets r=me if this check is called out as a hack and we commit to fix it in the next cycle.
Attachment #543529 - Flags: review?(justin.lebar+bug) → review+
> You *always* want to compare the number of swap events to the number of non-swap events.

For instance, I might want to know: "Is FF becoming less swappy for users?"  I'd compute this by comparing the number of events below a threshold (say less than 1000 faults / min) to those above a threshold.  The 0s are crucial here.

(This further shows why it doesn't make sense to exclude 0s: A value of 0 faults / min isn't a lot different from 2 faults / min.  Yet only one makes it into the histogram?)
Also, WRT the point that you can infer the number of 0s by examining another memory reporter: That only works if the memory reporters aren't randomly returning 0.  :)
Comment on attachment 543528 [details] [diff] [review]
enforce mem reporters

Review of attachment 543528 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #543528 - Flags: review?(nnethercote) → review+
Target Milestone: flash10 → ---
Attached patch UNITS_COUNT fix. (obsolete) — Splinter Review
I tested this and didn't see 0s coming from anywhere except hard-page-faults, where it's expected.

If RSS is still reporting 0 for some unknown reason, I don't think we should hide it, as we currently do.  Since it's possible that the 0s are reported in place of certain values (as we don't know where they were coming from), taking out the 0s entirely kind of invalidates the RSS measure.

If you really feel strongly that 0 hard page faults in an interval shouldn't be reported, then let's exclude 0s for UNITS_COUNT, but not for other units.  Although I still feel like that's entirely the wrong thing to do.  Maybe njn can referee...
Attachment #543613 - Attachment description: UNITS_COUNTS fix. → UNITS_COUNT fix.
Comment on attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

Taras, do you think we can get this in before we branch?
Attachment #543613 - Flags: review?(tglek)
Comment on attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

>-      // hack to deal with some memory reporters returning 0
>-      if (val)
>-        h.add(val);
>+      h.add(val);
This is still needed for UNITS_BYTES.
Attachment #543613 - Flags: review?(tglek) → review-
Comment on attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

Maybe njn can arbitrate; would you be OK with that, Taras?
Attachment #543613 - Flags: feedback?(nnethercote)
I'd like to include zero counts for byte measurements.  We should never get zero for them, AFAIK -- memory reporters return -1 if they fail -- so if we do that is interesting, and probably a sign something is buggy.
I think we all agree with including the zeros for byte measurements.  What about for counts?
Comment on attachment 543613 [details] [diff] [review]
UNITS_COUNT fix.

Review of attachment 543613 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #543613 - Flags: feedback?(nnethercote) → feedback+
(In reply to comment #30)
> I'd like to include zero counts for byte measurements.  We should never get
> zero for them, AFAIK -- memory reporters return -1 if they fail -- so if we
> do that is interesting, and probably a sign something is buggy.

Ok, so r+ with a fix that checks for -1 and does a continue.

Math.round(-1/1024)==0 :)
(In reply to comment #31)
> I think we all agree with including the zeros for byte measurements.  What
> about for counts?

I don't think we should throw away any valid measurements, ie. measurements where nothing went wrong when taking the measurement.

Do we need to check for -1 for page fault counts?
> Do we need to check for -1 for page fault counts?

Yes, they return -1 on failure.
Attachment #543613 - Attachment is obsolete: true
Comment on attachment 543627 [details] [diff] [review]
Fix for UNITS_COUNT v2

Thanks, happy to see a proper fix for this.
Please triplecheck that this works as expected on your machine.
Attachment #543627 - Flags: review?(tglek) → review+
Checked one last time that it works on my machine, and pushed to inbound.

http://hg.mozilla.org/integration/mozilla-inbound/rev/016fcb800fc2
Whiteboard: [MemShrink:P1] → [MemShrink:P1][inbound]
http://hg.mozilla.org/integration/mozilla-inbound/rev/1ec0339ac9a1

That is "patch 2" with some minor rebasing.

This should be the last patch for this bug, so when it's merged to m-c this bug can be marked as RESOLVED FIXED.  Thanks.
http://hg.mozilla.org/mozilla-central/rev/016fcb800fc2
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla7
ehr, still missing part 2 from this merge.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
http://hg.mozilla.org/mozilla-central/rev/1ec0339ac9a1
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: