Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault with gegen group() #28

Closed
sergiocorreia opened this issue Nov 10, 2017 · 14 comments
Closed

segfault with gegen group() #28

sergiocorreia opened this issue Nov 10, 2017 · 14 comments

Comments

@sergiocorreia
Copy link

I'm running a fairly complex gegen on a fairly large dataset:

. count
  142,686,929
. sort id4 id3 id5 id1 // sorted but not exactly by what i want to gegen
. gegen double loanid = group(id1 id2 id3 id4 id5 id6), missing verbose

Immediately after that I get a segfault, even when using a server with lots of memory (64gb). This is on Stata 14 MP/6 on Linux.

I would usually try to provide a minimum working example that replicates this error but even when logging or when running with set trace on I can't catch what is going on.

Is it possible that there is not enough memory to copy all the variables and that causes the segfault?

Thanks!
Sergio

@mcaceresb
Copy link
Owner

Can you give me desc and sum on id1 to id6?

@mcaceresb
Copy link
Owner

Out of memory should not cause segfault. If all the IDs are integers I am thinking this is a mistake with the bijection code. To check, you can do

gen str1 dummy = " "
gegen double loanid = group(id1 id2 id3 id4 id5 id6 dummy), missing verbose

I'll try to hop on a server tomo and check if I can reproduce (but it will work better with the output of desc and sum).

@sergiocorreia
Copy link
Author

Probably won't be able to until tomorrow or Monday (Veterans Day), but I'll try to send you something reproductible.

BTW, in case it's worth it on your side, I added a shortcut to -fegen- in case the dataset is already sorted by the variables (in which case it's faster to just do it from Stata (lines 111-112).

@mcaceresb
Copy link
Owner

I don't get the speedups you mention (Stata/MP 14.1 with 8 cores). I'm not sure why, though. You also said that it is faster for you to do that type of thing in issue 21. But even after sort, "by `varlist'" is not very fast on my end. Can you point me to some benchmarks? Maybe then I can try to debug when it might be faster to switch.

A reproducible example would be better but even variable descriptions would help. I doubt it's memory. I have this run, which used ~20GiB of memory at most. Here is an example where the bijection would overflow:

. set rmsg on
r; t=0.00 7:57:51

. clear
r; t=0.01 7:57:54

. set obs 142686929
number of observations (_N) was 0, now 142,686,929
r; t=3.09 7:57:57

. gen long id1 = int(runiform() * 10000)
r; t=13.74 7:58:11

. gen long id2 = int(rnormal()  * 10000)
r; t=27.45 7:58:39

. gen long id3 = int(runiform() * 10000)
r; t=16.32 7:58:55

. gen long id4 = int(rnormal()  * 10000)
r; t=32.42 7:59:28

. gen long id5 = int(runiform() * 10000)
r; t=20.07 7:59:48

. gen long id6 = int(rnormal()  * 10000)
r; t=31.13 8:00:19

. gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2)
Plugin setup; .02 seconds
Generated targets; 8.951 seconds
Parsed by variables; .002 seconds
        Plugin step 1: Read in by variables; 20.210 seconds.
Bijection OK with all integers (i.e. no extended miss val)? No.
Values OK but range (1,277,352,040,000,000,000 * 112,824) too large; falling back on hash.
                Plugin step 2.1: Determined hashing strategy; 6.820 seconds.
                Plugin step 2.3: Hashed variables (128-bit); 8.210 seconds.
Radix sort on hash (16-bits at a time)
                Plugin step 2.4: Sorted integer-only hash; 38.300 seconds.
        Plugin step 2: Hashed by variables; 53.330 seconds.
Found 1 64-bit hash collision(s). Fell back on 128-bit hash.
N = 142,686,929; 142,686,929 balanced groups of size 1
        Plugin step 3: Set up panel; 2.710 seconds.
                Plugin step 4.1: Checked for hash collisions; 8.130 seconds.
There were no hash collisions: 6 variables, 142,686,929 obs, 142,686,929 groups
                Plugin step 4.2: Keep only one row per group; 10.120 seconds.
                Plugin step 4.3: Sorted groups in memory; 72.550 seconds.
        Plugin step 4: Created indexed array with sorted by vars; 91.900 seconds.
        Plugin step 5: Copied back encoding to Stata; 61.750 seconds.
Plugin runtime; 230.6 seconds
Total runtime; 240.5 seconds
r; t=240.54 8:04:20

Here is one where it wouldn't:

. set rmsg on
r; t=0.00 8:05:28

. clear
r; t=0.01 8:05:28

. set obs 142686929
number of observations (_N) was 0, now 142,686,929
r; t=3.13 8:05:31

. gen long id1 = int(runiform() * 100)
r; t=14.24 8:05:46

. gen long id2 = int(rnormal()  * 100)
r; t=27.44 8:06:13

. gen long id3 = int(runiform() * 100)
r; t=16.52 8:06:30

. gen long id4 = int(rnormal()  * 100)
r; t=31.03 8:07:01

. gen long id5 = int(runiform() * 100)
r; t=20.06 8:07:21

. gen long id6 = int(rnormal()  * 100)
r; t=31.18 8:07:53

. gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2)
Plugin setup; .02 seconds
Generated targets; 8.941 seconds
Parsed by variables; .003 seconds
        Plugin step 1: Read in by variables; 20.380 seconds.
Bijection OK with all integers (i.e. no extended miss val)? Yes.
                Plugin step 2.1: Determined hashing strategy; 6.770 seconds.
                Plugin step 2.3: Bijected integers to natural numbers; 3.360 seconds.
Radix sort on hash (16-bits at a time)
                Plugin step 2.4: Sorted integer-only hash; 27.610 seconds.
        Plugin step 2: Hashed by variables; 37.740 seconds.
N = 142,686,929; 142,686,686 unbalanced groups of sizes 1 to 2
        Plugin step 3: Set up panel; 2.090 seconds.
                Plugin step 4.2: Keep only one row per group; 9.410 seconds.
        Plugin step 4: Created indexed array with sorted by vars; 10.550 seconds.
        Plugin step 5: Copied back encoding to Stata; 21.510 seconds.
Plugin runtime; 92.82 seconds
Total runtime; 102.4 seconds
r; t=102.49 8:09:35

@sergiocorreia
Copy link
Author

About the overflow, I couldn't replicate it (I managed to get a very similar dataset with the same number of obs., but the program ran successfully). The only thing I can think of is either a server-specific memory error (unlikely), or something very specific to the data at that point in time. For instance, the variables used to create the group were unique identifiers of the dataset, so there were 142 million different values of the ID.

About the speed up, this is a simple example that I ran on my personal laptop (4 cores), but tbh the gain doesn't seem very large:

clear
set obs 20000000
gen double id1 = ceil(runiform()*100) * 1000
gen double id2 = ceil(runiform()*1000) * 10000
gen double id3 = ceil(runiform()*10000) * 10000
sort id1 id2 id3

timer clear

cap drop id
timer on 1
by id1 id2 id3: gen byte _ = (_n == 1)
qui gen double id = sum(_)
drop _
timer off 1
mata: hash1(st_data(., "id"))

cap drop id
timer on 2
by id1 id2 id3: gen double id = (_n == 1)
qui replace id = sum(id)
timer off 2
mata: hash1(st_data(., "id"))

cap drop id
timer on 3
gegen double id = group(id1 id2 id3)
timer off 3
mata: hash1(st_data(., "id"))

timer list

And the timing results:


. timer list
   1:      3.31 /        1 =       3.3070
   2:      2.82 /        1 =       2.8160
   3:      3.46 /        1 =       3.4560

So it seems there are some savings but not much...

@mcaceresb
Copy link
Owner

mcaceresb commented Nov 14, 2017

Here is the benchmark on my computer (Stata/IC):

   1:     23.16 /        1 =      23.1550
   2:     22.59 /        1 =      22.5950
   3:      2.86 /        1 =       2.8560

but on a server (Stata/MP 8 cores):

   1:      3.40 /        1 =       3.3980
   2:      2.38 /        1 =       2.3800
   3:      4.40 /        1 =       4.4030

So it is faster in Stata/MP with many cores (almost 2x) but 10x slower in IC (which is my use case). One option would be to switch if, say, Stata is using >= 4 cores. But I'd rather do something like bypassing the hash when the data is sorted.

Reading the by variables (1s), checking if the doubles area really integers (0.7s; they are but the bijection requires too large an index), and hashing the data (1.9s) overall take 3.6 seconds, which is most of the runtime. I can't speed up the first step, but I can skip the second step and the hash if the data is sorted. There's 2.6s there to cut, which is plenty, so I'll have to figure out how to generate my panelsetup from the sorted set of variables to bring gegen back to par in this scenario.

@mcaceresb
Copy link
Owner

Here's a run with all 6 IDs being unique:

clear
set obs 142686929
gen rsort = .

forvalues i = 1 / 6 {
    replace rsort = runiform() * 100
    sort rsort
    gen long id`i' = _n
}

replace rsort = runiform() * 100
sort rsort

gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2)

Gives

Plugin setup; 0 seconds
Generated targets; 8.373 seconds
Parsed by variables; .005 seconds
        Plugin step 1: Read in by variables; 17.830 seconds.
Bijection OK with all integers (i.e. no extended miss val)? No.
Values OK but range (20,359,559,707,451,041 * 142,686,929) too large; falling back on hash.
                Plugin step 2.1: Determined hashing strategy; 6.190 seconds.
                Plugin step 2.3: Hashed variables (128-bit); 6.480 seconds.
Radix sort on hash (16-bits at a time)
                Plugin step 2.4: Sorted integer-only hash; 36.160 seconds.
        Plugin step 2: Hashed by variables; 48.840 seconds.
                Plugin step 3.1: Created group index; 1.360 seconds.
N = 142,686,929; 142,686,929 balanced groups of size 1
                Plugin step 3.2: Normalized group index and Stata index; 0.720 seconds.
        Plugin step 3: Set up panel; 2.080 seconds.
                Plugin step 4.1: Checked for hash collisions; 9.830 seconds.
There were no hash collisions: 6 variables, 142,686,929 obs, 142,686,929 groups
                Plugin step 4.2: Keep only one row per group; 10.720 seconds.
                Plugin step 4.3: Sorted groups in memory; 62.050 seconds.
        Plugin step 4: Created indexed array with sorted by vars; 83.790 seconds.
                Plugin step 5.1: Generated encoding in Stata order; 27.520 seconds.
                Plugin step 5.2: Copied back encoding to Stata; 2.060 seconds.
        Plugin step 5: Copied back encoding to Stata; 29.590 seconds.
Plugin runtime; 182.2 seconds
Total runtime; 190.6 seconds

@mcaceresb
Copy link
Owner

Re: Speedup, I now set up the panel directly and runtime goes down considerably. Still marginally slower but I think I can call this a day. I will run tests and commit later in the week.

   1:      3.62 /        1 =       3.6210
   2:      2.66 /        1 =       2.6600
   3:      2.84 /        1 =       2.8420

@sergiocorreia
Copy link
Author

Cool! Yes, I agree that should be enough.

I'm still bugged about those crashes I was having last week, but can't replicate them so I think you might want to close the issue at least for now.

@mcaceresb
Copy link
Owner

I am bugged as well, but I cannot replicate them either. On my end it handles that many observatrions just fine... Maybe it was a case where it just hit the bijection limit? So it was just under, and it passed the check I made, but when it came down to it it overflowed? That might make sense if you were using a version of gtools < 0.9.0, but from 0.9.0 onward I changed the limit to be the largest unsigned integer over 2.

@mcaceresb
Copy link
Owner

One annoyance with the speedup is that it is actually slower if all variables are integers and the bijection will not overflow. So

  1. "Speedup" slower vs all inputs are integers and bijection OK
  2. Speedup faster vs inputs are not integers and must use hash.
  3. Speedup faster vs inputs are integers but bijection might overflow.

So the "speedup" is not always faster and not always slower. I'm thinking to add a hash() option that switches hash methods, so the user can choose:

  • 0: Default. Uses the panelsetup if sorted, bijection if all integers, hash otherwise.
  • 1: Always hashes. Uses bijection if all integers, hash otherwise.
  • 2: Always uses spookyhash.

In all cases, if the data is already sorted then the hash would not be re-sorted.

@sergiocorreia
Copy link
Author

Makes sense. One thing that would be best as a some sort of "endgame" is to expose the c functions through thin wrappers, and do everything through that. That's in a way what I did with ftools, and some reasons for that is I) easier to extend by others, ii) easier to run extensive tests against a smaller set of c functions and iii) easier to improve later on.

Is that feasible or maybe even already there? Like, what would I need if I wanted to accelerate reghdfe with those functions (or any data intensive package that uses categories)

@mcaceresb
Copy link
Owner

Re: reghdfe. What do you need from ftools? If you tell me then maybe the functionality is already present, but I'm not sure what you need.

Re: Wrappers. I don't think they will be as useful as your wrappers.

I've made work easier for myself by making every .ado file a wrapper already for _gtools_internals.ado, whcih is a massive function that handles everything and sets it up for C. Hence C assumes that a bunch of scalars, matrices, and all the relevant variables will always exist and have specific shapes. This is because C cannot create variables, matrices, or anything that doesn't already exist in Stata.

So I could write a number of wrappers for various parts of the function, but executing them would be way slower than executing a full function run. Further, they couldn't be thin wrappers for C because C cannot interact with Stata outside a relatively restricted set of functions (more specifically, I cannot keep objects I made in C available after the plugin has finished executing).

However, I could write thin wrappers for _gtools_internal.ado. To pull the curtain back a little, all the functions (except gisid, which executes 1-4 but is different onward) have a commonality to them:

  1. Read the data
  2. Determine hashing strategy (this includes an "is sorted" check)
  3. Hash
  4. Sort hash (keeping track of how it maps to Stata row numbers)
  5. Panel setup
  6. Check for collisions
  7. Sort the groups (with index)
  8. Map sorted group index to sorted hash index
  9. Function-specific stuff

Steps 2, 3, 6, and 7 require a copy of the data be available for C in memory. Saving the results in steps 3, 4, 5, or 6-8 would require creating variables in Stata in addition to allocating memory in C. Also, there is an inefficiency throughout in casting doubles to and from 64-bit integers. Here is the stuff I could write:

  • Is the data sorted? Executes 1 and checks if sorted. Gives yes or no.

  • Is the bijection OK? Executes 1 and checks if can biject. Gives yes or no.

  • Hash. Creates 3 variables from Stata and executes 1-3. The first two variables need to be double and store either the bijection and empty or the two parts of the spookyhash. The third variable can be long or double and is the index of the hash to the Stata observations (in case the user passes [if] [in] or drops missing rows).

  • Hash sort. Either creates 3 variables from Stata and executes 1-4 or picks up from the hash step above and executes 4. It sorts the hash and stores the sorted hash along with the index.

  • Panel setup. Either creates 2 variables from Stata and executes 1-5 or it creates 1 variable and picks up form the hash sort step above and executes 5. This step creates the index to Stata observations, if it does not exist, and it stores in the first J observations the start points of the grouped data.

  • Check for collisions + sort groups + map sorted group index to hash index. This can pick up from the panel setup step by creating one extra variable (which will be the sort order of the groups); it would re-read the observations into memory, check for collisions, sort them, and store the sort index. It can also do steps 1-8 directly after creating 3 variables from Stata.

Is that useful? Running any one of those would be fast, but running any sequence of those would be really slow. Of course, if the user could somehow interact with C results directly then it would not be a problem, but this is not possible with the current plugin interface.

@mcaceresb
Copy link
Owner

I will close this issue, as you suggest, and we can continue the discussion of additions to gtools (including an API, integration with reghdfe, etc.) in issue #30.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants