-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault with gegen group() #28
Comments
Can you give me desc and sum on id1 to id6? |
Out of memory should not cause segfault. If all the IDs are integers I am thinking this is a mistake with the bijection code. To check, you can do gen str1 dummy = " "
gegen double loanid = group(id1 id2 id3 id4 id5 id6 dummy), missing verbose I'll try to hop on a server tomo and check if I can reproduce (but it will work better with the output of desc and sum). |
Probably won't be able to until tomorrow or Monday (Veterans Day), but I'll try to send you something reproductible. BTW, in case it's worth it on your side, I added a shortcut to -fegen- in case the dataset is already sorted by the variables (in which case it's faster to just do it from Stata (lines 111-112). |
I don't get the speedups you mention (Stata/MP 14.1 with 8 cores). I'm not sure why, though. You also said that it is faster for you to do that type of thing in issue 21. But even after sort, "by `varlist'" is not very fast on my end. Can you point me to some benchmarks? Maybe then I can try to debug when it might be faster to switch. A reproducible example would be better but even variable descriptions would help. I doubt it's memory. I have this run, which used ~20GiB of memory at most. Here is an example where the bijection would overflow: . set rmsg on
r; t=0.00 7:57:51
. clear
r; t=0.01 7:57:54
. set obs 142686929
number of observations (_N) was 0, now 142,686,929
r; t=3.09 7:57:57
. gen long id1 = int(runiform() * 10000)
r; t=13.74 7:58:11
. gen long id2 = int(rnormal() * 10000)
r; t=27.45 7:58:39
. gen long id3 = int(runiform() * 10000)
r; t=16.32 7:58:55
. gen long id4 = int(rnormal() * 10000)
r; t=32.42 7:59:28
. gen long id5 = int(runiform() * 10000)
r; t=20.07 7:59:48
. gen long id6 = int(rnormal() * 10000)
r; t=31.13 8:00:19
. gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2)
Plugin setup; .02 seconds
Generated targets; 8.951 seconds
Parsed by variables; .002 seconds
Plugin step 1: Read in by variables; 20.210 seconds.
Bijection OK with all integers (i.e. no extended miss val)? No.
Values OK but range (1,277,352,040,000,000,000 * 112,824) too large; falling back on hash.
Plugin step 2.1: Determined hashing strategy; 6.820 seconds.
Plugin step 2.3: Hashed variables (128-bit); 8.210 seconds.
Radix sort on hash (16-bits at a time)
Plugin step 2.4: Sorted integer-only hash; 38.300 seconds.
Plugin step 2: Hashed by variables; 53.330 seconds.
Found 1 64-bit hash collision(s). Fell back on 128-bit hash.
N = 142,686,929; 142,686,929 balanced groups of size 1
Plugin step 3: Set up panel; 2.710 seconds.
Plugin step 4.1: Checked for hash collisions; 8.130 seconds.
There were no hash collisions: 6 variables, 142,686,929 obs, 142,686,929 groups
Plugin step 4.2: Keep only one row per group; 10.120 seconds.
Plugin step 4.3: Sorted groups in memory; 72.550 seconds.
Plugin step 4: Created indexed array with sorted by vars; 91.900 seconds.
Plugin step 5: Copied back encoding to Stata; 61.750 seconds.
Plugin runtime; 230.6 seconds
Total runtime; 240.5 seconds
r; t=240.54 8:04:20 Here is one where it wouldn't: . set rmsg on
r; t=0.00 8:05:28
. clear
r; t=0.01 8:05:28
. set obs 142686929
number of observations (_N) was 0, now 142,686,929
r; t=3.13 8:05:31
. gen long id1 = int(runiform() * 100)
r; t=14.24 8:05:46
. gen long id2 = int(rnormal() * 100)
r; t=27.44 8:06:13
. gen long id3 = int(runiform() * 100)
r; t=16.52 8:06:30
. gen long id4 = int(rnormal() * 100)
r; t=31.03 8:07:01
. gen long id5 = int(runiform() * 100)
r; t=20.06 8:07:21
. gen long id6 = int(rnormal() * 100)
r; t=31.18 8:07:53
. gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2)
Plugin setup; .02 seconds
Generated targets; 8.941 seconds
Parsed by variables; .003 seconds
Plugin step 1: Read in by variables; 20.380 seconds.
Bijection OK with all integers (i.e. no extended miss val)? Yes.
Plugin step 2.1: Determined hashing strategy; 6.770 seconds.
Plugin step 2.3: Bijected integers to natural numbers; 3.360 seconds.
Radix sort on hash (16-bits at a time)
Plugin step 2.4: Sorted integer-only hash; 27.610 seconds.
Plugin step 2: Hashed by variables; 37.740 seconds.
N = 142,686,929; 142,686,686 unbalanced groups of sizes 1 to 2
Plugin step 3: Set up panel; 2.090 seconds.
Plugin step 4.2: Keep only one row per group; 9.410 seconds.
Plugin step 4: Created indexed array with sorted by vars; 10.550 seconds.
Plugin step 5: Copied back encoding to Stata; 21.510 seconds.
Plugin runtime; 92.82 seconds
Total runtime; 102.4 seconds
r; t=102.49 8:09:35 |
About the overflow, I couldn't replicate it (I managed to get a very similar dataset with the same number of obs., but the program ran successfully). The only thing I can think of is either a server-specific memory error (unlikely), or something very specific to the data at that point in time. For instance, the variables used to create the group were unique identifiers of the dataset, so there were 142 million different values of the ID. About the speed up, this is a simple example that I ran on my personal laptop (4 cores), but tbh the gain doesn't seem very large: clear
set obs 20000000
gen double id1 = ceil(runiform()*100) * 1000
gen double id2 = ceil(runiform()*1000) * 10000
gen double id3 = ceil(runiform()*10000) * 10000
sort id1 id2 id3
timer clear
cap drop id
timer on 1
by id1 id2 id3: gen byte _ = (_n == 1)
qui gen double id = sum(_)
drop _
timer off 1
mata: hash1(st_data(., "id"))
cap drop id
timer on 2
by id1 id2 id3: gen double id = (_n == 1)
qui replace id = sum(id)
timer off 2
mata: hash1(st_data(., "id"))
cap drop id
timer on 3
gegen double id = group(id1 id2 id3)
timer off 3
mata: hash1(st_data(., "id"))
timer list And the timing results:
So it seems there are some savings but not much... |
Here is the benchmark on my computer (Stata/IC): 1: 23.16 / 1 = 23.1550
2: 22.59 / 1 = 22.5950
3: 2.86 / 1 = 2.8560 but on a server (Stata/MP 8 cores): 1: 3.40 / 1 = 3.3980
2: 2.38 / 1 = 2.3800
3: 4.40 / 1 = 4.4030 So it is faster in Stata/MP with many cores (almost 2x) but 10x slower in IC (which is my use case). One option would be to switch if, say, Stata is using >= 4 cores. But I'd rather do something like bypassing the hash when the data is sorted. Reading the by variables (1s), checking if the doubles area really integers (0.7s; they are but the bijection requires too large an index), and hashing the data (1.9s) overall take 3.6 seconds, which is most of the runtime. I can't speed up the first step, but I can skip the second step and the hash if the data is sorted. There's 2.6s there to cut, which is plenty, so I'll have to figure out how to generate my panelsetup from the sorted set of variables to bring gegen back to par in this scenario. |
Here's a run with all 6 IDs being unique: clear
set obs 142686929
gen rsort = .
forvalues i = 1 / 6 {
replace rsort = runiform() * 100
sort rsort
gen long id`i' = _n
}
replace rsort = runiform() * 100
sort rsort
gegen g_id = group(id1 id2 id3 id4 id5 id6), v bench(2) Gives
|
Re: Speedup, I now set up the panel directly and runtime goes down considerably. Still marginally slower but I think I can call this a day. I will run tests and commit later in the week. 1: 3.62 / 1 = 3.6210
2: 2.66 / 1 = 2.6600
3: 2.84 / 1 = 2.8420 |
Cool! Yes, I agree that should be enough. I'm still bugged about those crashes I was having last week, but can't replicate them so I think you might want to close the issue at least for now. |
I am bugged as well, but I cannot replicate them either. On my end it handles that many observatrions just fine... Maybe it was a case where it just hit the bijection limit? So it was just under, and it passed the check I made, but when it came down to it it overflowed? That might make sense if you were using a version of gtools < 0.9.0, but from 0.9.0 onward I changed the limit to be the largest unsigned integer over 2. |
One annoyance with the speedup is that it is actually slower if all variables are integers and the bijection will not overflow. So
So the "speedup" is not always faster and not always slower. I'm thinking to add a
In all cases, if the data is already sorted then the hash would not be re-sorted. |
Makes sense. One thing that would be best as a some sort of "endgame" is to expose the c functions through thin wrappers, and do everything through that. That's in a way what I did with ftools, and some reasons for that is I) easier to extend by others, ii) easier to run extensive tests against a smaller set of c functions and iii) easier to improve later on. Is that feasible or maybe even already there? Like, what would I need if I wanted to accelerate reghdfe with those functions (or any data intensive package that uses categories) |
Re: reghdfe. What do you need from ftools? If you tell me then maybe the functionality is already present, but I'm not sure what you need. Re: Wrappers. I don't think they will be as useful as your wrappers. I've made work easier for myself by making every .ado file a wrapper already for _gtools_internals.ado, whcih is a massive function that handles everything and sets it up for C. Hence C assumes that a bunch of scalars, matrices, and all the relevant variables will always exist and have specific shapes. This is because C cannot create variables, matrices, or anything that doesn't already exist in Stata. So I could write a number of wrappers for various parts of the function, but executing them would be way slower than executing a full function run. Further, they couldn't be thin wrappers for C because C cannot interact with Stata outside a relatively restricted set of functions (more specifically, I cannot keep objects I made in C available after the plugin has finished executing). However, I could write thin wrappers for _gtools_internal.ado. To pull the curtain back a little, all the functions (except gisid, which executes 1-4 but is different onward) have a commonality to them:
Steps 2, 3, 6, and 7 require a copy of the data be available for C in memory. Saving the results in steps 3, 4, 5, or 6-8 would require creating variables in Stata in addition to allocating memory in C. Also, there is an inefficiency throughout in casting doubles to and from 64-bit integers. Here is the stuff I could write:
Is that useful? Running any one of those would be fast, but running any sequence of those would be really slow. Of course, if the user could somehow interact with C results directly then it would not be a problem, but this is not possible with the current plugin interface. |
I will close this issue, as you suggest, and we can continue the discussion of additions to gtools (including an API, integration with reghdfe, etc.) in issue #30. |
I'm running a fairly complex
gegen
on a fairly large dataset:Immediately after that I get a segfault, even when using a server with lots of memory (64gb). This is on Stata 14 MP/6 on Linux.
I would usually try to provide a minimum working example that replicates this error but even when logging or when running with
set trace on
I can't catch what is going on.Is it possible that there is not enough memory to copy all the variables and that causes the segfault?
Thanks!
Sergio
The text was updated successfully, but these errors were encountered: