Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong number of groups #90

Open
m-elholm opened this issue Apr 27, 2023 · 1 comment
Open

Wrong number of groups #90

m-elholm opened this issue Apr 27, 2023 · 1 comment
Labels

Comments

@m-elholm
Copy link

Describe the bug
I think there is an error in the gunique and also gegen xx =nunique functions. In a sample of 35 million observations it does not count the number of unique values correctly.
When I generate a variable x = _n , there should be 35 mil. unique observations, but it only count it as 25 million,

// code snippet
gen x = _n 
gunique x

Version info

  • OS: [e.g. Windows 10]
  • Version: [i.e. output of gtools]
@mcaceresb
Copy link
Owner

@m-elholm I think the more likely explanation is that you've run into the limits of 4-byte floats (see the generating IDs section here). This snippet shows that gunique is working correctly, and that x is indeed the problem, which has repeated values:

. clear

. set obs 35000000
Number of observations (_N) was 0, now 35,000,000.

. gen x = _n

. gen long y = _n

. gen double z = _n

. gunique x
N = 35,000,000; 25,527,216 unbalanced groups of sizes 1 to 5

. gunique y
N = 35,000,000; 35,000,000 balanced groups of size 1

. gunique z
N = 35,000,000; 35,000,000 balanced groups of size 1


. format %21.0fc x y z

. l in `=_N-4'/l

          +--------------------------------------+
          |          x            y            z |
          |--------------------------------------|
34999996. | 34,999,996   34,999,996   34,999,996 |
34999997. | 34,999,996   34,999,997   34,999,997 |
34999998. | 35,000,000   34,999,998   34,999,998 |
34999999. | 35,000,000   34,999,999   34,999,999 |
35000000. | 35,000,000   35,000,000   35,000,000 |
          +--------------------------------------+

One solution is to type such data `c(obs_t)', which contains the smallest data type that can store _n (and will change as the number of observations in your data changes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants