Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

向 varchar字段保存中文中会被截断 #182

Open
dotqi opened this issue Aug 2, 2022 · 12 comments
Open

向 varchar字段保存中文中会被截断 #182

dotqi opened this issue Aug 2, 2022 · 12 comments

Comments

@dotqi
Copy link

dotqi commented Aug 2, 2022

向 varchar字段保存中文中会被截断。
测试发现。文件param.go绑定strinig时。size取的是转为UTF16后的长度。
但一个中文在UTF8时占用3个字节。size应该取UTF8的长度。
这样才可以保证向varchar字段保存中文时不会被截断。

@Salmon-x
Copy link

Salmon-x commented Nov 2, 2022

所以应该改动哪里呢

@dotqi
Copy link
Author

dotqi commented Nov 3, 2022

我个人的处理办法是在param.go的67行下增加

		if p.isDescribed && (p.SQLType == api.SQL_VARCHAR || p.SQLType == api.SQL_CHAR) {
			size = api.SQLULEN(len(d))
			if size < 1 {
				// size cannot be less then 1 even for empty fields
				size = 1
			}
		}

个人能力有限。你要好好测试一下。

@fanybook
Copy link

fanybook commented Dec 9, 2022

@dotqi 好奇请教一下

UTF16 和 UTF8 在中文长度上有区别么?

@dotqi
Copy link
Author

dotqi commented Dec 10, 2022

package main

import (
"fmt"
"unicode/utf16"
)

func main() {
val := "你好"
fmt.Println(val)
u16 := StringToUTF16(val)
fmt.Println("u16", len(u16), u16)
// u16 3 [20320 22909 0]
u8 := []byte(val)
fmt.Println("u8", len(u8), u8)
// u8 6 [228 189 160 229 165 189]
}

func StringToUTF16(s string) []uint16 { return utf16.Encode([]rune(s + "\x00")) }

@fanybook
Copy link

fanybook commented Dec 12, 2022

package main

import ( "fmt" "unicode/utf16" )

func main() { val := "你好" fmt.Println(val) u16 := StringToUTF16(val) fmt.Println("u16", len(u16), u16) // u16 3 [20320 22909 0] u8 := []byte(val) fmt.Println("u8", len(u8), u8) // u8 6 [228 189 160 229 165 189] }

func StringToUTF16(s string) []uint16 { return utf16.Encode([]rune(s + "\x00")) }

@dotqi 你这个实验,我做了,感觉你这个不对

  1. UTF16 和 UTF8 都是可变长度的编码,区别在于 UTF16 最小是两字节,UTF8 最小是一字节
  2. 你这个计算的是 []uint16 的长度,而非 []byte 长度
    u16 := StringToUTF16(val)
    fmt.Println("u16", len(u16), u16)
  3. 你这个计算的是 []byte 的长度
    u8 := []byte(val)
    fmt.Println("u8", len(u8), u8)

但实际上,u16 可以转回成 []rune,他们的长度是一模一样的

val := "你好𠀾"
[]uint16 = [20320 22909 55360 56382]
[]rune = [20320 22909 131134]
[]byte = [228 189 160 229 165 189 240 160 128 190]

rune 实际是 int32,比 uint16 表示的多,131134 和 55360 56382 都表示 “𠀾”,它实际占 4 个字节 [240 160 128 190]

@fanybook
Copy link

fanybook commented Dec 12, 2022

@dotqi

分析:

  1. size = api.SQLULEN(len(d)) 看你改的方式,实际上是 len(s),其实就是 []byte 的长度
  2. 看这个库原来的方式,比较奇怪,它底层是用的 cgo 调的 c 的 odbc,不明白它为什么转 utf16 而不是转 byte,因为 c 里边都是 char,api.StringToUTF16 给字符串后面加了一个 \0,就是告诉 c 这是一个字符串,它转完 utf16 又给长度*2 做了 buflen
    	b := api.StringToUTF16(d)
    	p.Data = b
    	buf = unsafe.Pointer(&b[0])
    	l := len(b)
    	l -= 1 // remove terminating 0
    	size = api.SQLULEN(l)
    	if size < 1 {
    		// size cannot be less then 1 even for empty fields
    		size = 1
    	}
    	l *= 2 // every char takes 2 bytes
    	buflen = api.SQLLEN(l)
    	plen = p.StoreStrLen_or_IndPtr(buflen)
  3. []byte 长度可以为 0,string 就必须是1,应该是因为 \0 的缘故

错误的点:
l *= 2 // every char takes 2 bytes
4字节的 utf16 转换后是 2 个 uint16,*2 没问题
但是 3字节的 utf16 转换后可能是 1 个 uint16,这里 *2 就错了,*3才对

按照 case []byte: 的改写:

	case string:
		ctype = api.SQL_C_WCHAR
		// b := api.StringToUTF16(d)
		// p.Data = b
		// buf = unsafe.Pointer(&b[0])
		// l := len(b)
		// l -= 1 // remove terminating 0
		// size = api.SQLULEN(l)
		// if size < 1 {
		// 	// size cannot be less then 1 even for empty fields
		// 	size = 1
		// }
		// l *= 2 // every char takes 2 bytes
		// buflen = api.SQLLEN(l)
		// plen = p.StoreStrLen_or_IndPtr(buflen)
		b := []byte(d)
		p.Data = append(b, byte(0))
		buf = unsafe.Pointer(&b[0])
		buflen = api.SQLLEN(len(b))
		plen = p.StoreStrLen_or_IndPtr(buflen)
		size = api.SQLULEN(len(b))
		if size < 1 {
			// size cannot be less then 1 even for empty fields
			size = 1
		}

大胆点:因为使用了 append 所以不用再判断 size < 1 了,我不太明白为什么 buflen 不能包含 \0 的长度,所以参照原来减 1 了

https://learn.microsoft.com/zh-Cn/sql/odbc/reference/syntax/sqlbindparameter-function?view=sql-server-ver16

		b := append([]byte(d), byte(0))
		p.Data = b
		buf = unsafe.Pointer(&b[0])
		buflen = api.SQLLEN(len(b) - 1)
		plen = p.StoreStrLen_or_IndPtr(buflen)
		size = api.SQLULEN(len(b))

@dotqi
Copy link
Author

dotqi commented Dec 12, 2022

谢谢,回头按你的方式测试一下。

@zhangyongding
Copy link

增加一个PR:#195

@zhangyongding
Copy link

可以合并吗?

@alexbrainman
Copy link
Owner

I don't understand Chinese. If you want help, you should explain in English.

Alex

@zhangyongding
Copy link

When saving non ASCII characters to varchar fields, they will be truncated. The reason is that the length calculation is based on the length of the UTF-16 encoding, and if the actual data is in UTF-8, the calculated length will be smaller than the actual length, resulting in the actual data stored in the database being truncated.

@alexbrainman
Copy link
Owner

@zhangyongding thanks for explaining.

I assume you are suggesting that your PR #195 is fixing the issue.

If you want me to review your PR, you should add a test that demonstrate the problem, test that demonstrate the problem is fixed by your PR.

Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants