向 varchar字段保存中文中会被截断 #182

dotqi · 2022-08-02T09:37:02Z

向 varchar字段保存中文中会被截断。
测试发现。文件param.go绑定strinig时。size取的是转为UTF16后的长度。
但一个中文在UTF8时占用3个字节。size应该取UTF8的长度。
这样才可以保证向varchar字段保存中文时不会被截断。

Salmon-x · 2022-11-02T06:26:42Z

所以应该改动哪里呢

dotqi · 2022-11-03T08:21:46Z

我个人的处理办法是在param.go的67行下增加

		if p.isDescribed && (p.SQLType == api.SQL_VARCHAR || p.SQLType == api.SQL_CHAR) {
			size = api.SQLULEN(len(d))
			if size < 1 {
				// size cannot be less then 1 even for empty fields
				size = 1
			}
		}

个人能力有限。你要好好测试一下。

fanybook · 2022-12-09T07:24:07Z

@dotqi 好奇请教一下

UTF16 和 UTF8 在中文长度上有区别么？

dotqi · 2022-12-10T01:33:34Z

package main

import (
"fmt"
"unicode/utf16"
)

func main() {
val := "你好"
fmt.Println(val)
u16 := StringToUTF16(val)
fmt.Println("u16", len(u16), u16)
// u16 3 [20320 22909 0]
u8 := []byte(val)
fmt.Println("u8", len(u8), u8)
// u8 6 [228 189 160 229 165 189]
}

func StringToUTF16(s string) []uint16 { return utf16.Encode([]rune(s + "\x00")) }

fanybook · 2022-12-12T03:11:38Z

package main

import ( "fmt" "unicode/utf16" )

func main() { val := "你好" fmt.Println(val) u16 := StringToUTF16(val) fmt.Println("u16", len(u16), u16) // u16 3 [20320 22909 0] u8 := []byte(val) fmt.Println("u8", len(u8), u8) // u8 6 [228 189 160 229 165 189] }

func StringToUTF16(s string) []uint16 { return utf16.Encode([]rune(s + "\x00")) }

@dotqi 你这个实验，我做了，感觉你这个不对

UTF16 和 UTF8 都是可变长度的编码，区别在于 UTF16 最小是两字节，UTF8 最小是一字节

你这个计算的是 []uint16 的长度，而非 []byte 长度

u16 := StringToUTF16(val)
fmt.Println("u16", len(u16), u16)

你这个计算的是 []byte 的长度

u8 := []byte(val)
fmt.Println("u8", len(u8), u8)

但实际上，u16 可以转回成 []rune，他们的长度是一模一样的

val := "你好𠀾"
[]uint16 = [20320 22909 55360 56382]
[]rune = [20320 22909 131134]
[]byte = [228 189 160 229 165 189 240 160 128 190]

rune 实际是 int32，比 uint16 表示的多，131134 和 55360 56382 都表示 “𠀾”，它实际占 4 个字节 [240 160 128 190]

fanybook · 2022-12-12T03:51:16Z

@dotqi

分析：

size = api.SQLULEN(len(d)) 看你改的方式，实际上是 len(s)，其实就是 []byte 的长度

看这个库原来的方式，比较奇怪，它底层是用的 cgo 调的 c 的 odbc，不明白它为什么转 utf16 而不是转 byte，因为 c 里边都是 char，api.StringToUTF16 给字符串后面加了一个 \0，就是告诉 c 这是一个字符串，它转完 utf16 又给长度*2 做了 buflen

	b := api.StringToUTF16(d)
	p.Data = b
	buf = unsafe.Pointer(&b[0])
	l := len(b)
	l -= 1 // remove terminating 0
	size = api.SQLULEN(l)
	if size < 1 {
		// size cannot be less then 1 even for empty fields
		size = 1
	}
	l *= 2 // every char takes 2 bytes
	buflen = api.SQLLEN(l)
	plen = p.StoreStrLen_or_IndPtr(buflen)

[]byte 长度可以为 0，string 就必须是1，应该是因为 \0 的缘故

错误的点：
l *= 2 // every char takes 2 bytes
4字节的 utf16 转换后是 2 个 uint16，*2 没问题
但是 3字节的 utf16 转换后可能是 1 个 uint16，这里 *2 就错了，*3才对

按照 case []byte: 的改写：

	case string:
		ctype = api.SQL_C_WCHAR
		// b := api.StringToUTF16(d)
		// p.Data = b
		// buf = unsafe.Pointer(&b[0])
		// l := len(b)
		// l -= 1 // remove terminating 0
		// size = api.SQLULEN(l)
		// if size < 1 {
		// 	// size cannot be less then 1 even for empty fields
		// 	size = 1
		// }
		// l *= 2 // every char takes 2 bytes
		// buflen = api.SQLLEN(l)
		// plen = p.StoreStrLen_or_IndPtr(buflen)
		b := []byte(d)
		p.Data = append(b, byte(0))
		buf = unsafe.Pointer(&b[0])
		buflen = api.SQLLEN(len(b))
		plen = p.StoreStrLen_or_IndPtr(buflen)
		size = api.SQLULEN(len(b))
		if size < 1 {
			// size cannot be less then 1 even for empty fields
			size = 1
		}

大胆点：因为使用了 append 所以不用再判断 size < 1 了，我不太明白为什么 buflen 不能包含 \0 的长度，所以参照原来减 1 了

https://learn.microsoft.com/zh-Cn/sql/odbc/reference/syntax/sqlbindparameter-function?view=sql-server-ver16

		b := append([]byte(d), byte(0))
		p.Data = b
		buf = unsafe.Pointer(&b[0])
		buflen = api.SQLLEN(len(b) - 1)
		plen = p.StoreStrLen_or_IndPtr(buflen)
		size = api.SQLULEN(len(b))

dotqi · 2022-12-12T08:22:21Z

谢谢，回头按你的方式测试一下。

zhangyongding · 2024-04-02T05:35:36Z

增加一个PR：#195

zhangyongding · 2024-08-16T02:04:51Z

可以合并吗？

alexbrainman · 2024-08-17T04:31:55Z

I don't understand Chinese. If you want help, you should explain in English.

Alex

zhangyongding · 2024-08-30T02:46:50Z

When saving non ASCII characters to varchar fields, they will be truncated. The reason is that the length calculation is based on the length of the UTF-16 encoding, and if the actual data is in UTF-8, the calculated length will be smaller than the actual length, resulting in the actual data stored in the database being truncated.

alexbrainman · 2024-08-31T01:59:28Z

@zhangyongding thanks for explaining.

I assume you are suggesting that your PR #195 is fixing the issue.

If you want me to review your PR, you should add a test that demonstrate the problem, test that demonstrate the problem is fixed by your PR.

Alex

alexbrainman mentioned this issue Oct 3, 2024

Resolve the issue of Chinese truncation during varchar insertion #195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

向 varchar字段保存中文中会被截断 #182

向 varchar字段保存中文中会被截断 #182

dotqi commented Aug 2, 2022

Salmon-x commented Nov 2, 2022

dotqi commented Nov 3, 2022 •

edited

Loading

fanybook commented Dec 9, 2022

dotqi commented Dec 10, 2022

fanybook commented Dec 12, 2022 •

edited

Loading

fanybook commented Dec 12, 2022 •

edited

Loading

dotqi commented Dec 12, 2022

zhangyongding commented Apr 2, 2024

zhangyongding commented Aug 16, 2024

alexbrainman commented Aug 17, 2024

zhangyongding commented Aug 30, 2024

alexbrainman commented Aug 31, 2024

向 varchar字段保存中文中会被截断 #182

向 varchar字段保存中文中会被截断 #182

Comments

dotqi commented Aug 2, 2022

Salmon-x commented Nov 2, 2022

dotqi commented Nov 3, 2022 • edited Loading

fanybook commented Dec 9, 2022

dotqi commented Dec 10, 2022

fanybook commented Dec 12, 2022 • edited Loading

fanybook commented Dec 12, 2022 • edited Loading

dotqi commented Dec 12, 2022

zhangyongding commented Apr 2, 2024

zhangyongding commented Aug 16, 2024

alexbrainman commented Aug 17, 2024

zhangyongding commented Aug 30, 2024

alexbrainman commented Aug 31, 2024

dotqi commented Nov 3, 2022 •

edited

Loading

fanybook commented Dec 12, 2022 •

edited

Loading

fanybook commented Dec 12, 2022 •

edited

Loading