[Python]UTF-8 한글 자르기 :: nlogn's log :: 천천히 달리기

[Python]UTF-8 한글 자르기

nlogn 2011. 3. 18. 14:37

2011. 3. 18. 14:37

[출처 : http://www.coolsms.co.kr/?document_srl=62073]

이번에는 파이썬에서 PHP UTF-8 한글 자르기와 같은 내용으로 다뤄보겠습니다.

PHP와 마찬가지로 완성형 한글에서 한글 2바이트, 영어 1바이트 기준으로 잘라줍니다.

- UTF-8 한글 자르기

#-*- encoding: utf-8 -*-

import re

def strcut_utf8(str, destlen, checkmb=True, tail=""):

"""

UTF-8 Format

0xxxxxxx = ASCII, 110xxxxx 10xxxxxx or 1110xxxx 10xxxxxx 10xxxxxx

라틴 문자, 그리스 문자, 키릴 문자, 콥트 문자, 아르메니아 문자, 히브리 문자, 아랍 문자 는 2바이트

BMP(Basic Mulitilingual Plane) 안에 들어 있는 것은 3바이트(한글, 일본어 포함)

"""

slen = len(str)

tlen = len(tail)

if slen <= destlen:

return str

pattern = "[\xE0-\xFF][\x80-\xFF][\x80-\xFF]"

count=0

text = []

for match in re.finditer(pattern, str):

if len(checkmb == True and match.group(0)) > 1:

count = count + 2

else:

count = count + 1

if (count + tlen) > destlen:

return "".join(text) + tail

text.append(match.group(0))

return "".join(text)

다음과 같이 5바이트를 잘라야하는데 완성형 한글 기준으로 봤을 때 5바이트 시작은 한글 '다'의 앞쪽 코드이기 때문에 한글이 깨어지지 않도록 4바이트까지만 잘라서 "가나" 문자열을 리턴합니다.

print strcut_utf8("가나다라마바사아자차카타파하", 5, True, "")

가나

아래와 같이 strlen 함수도 가능합니다.

- 완성형 한글 기준 크기 알아내기

#-*- encoding: utf-8 -*-

import re

def strlen_utf8(str):

"""

UTF-8 Format

0xxxxxxx = ASCII, 110xxxxx 10xxxxxx or 1110xxxx 10xxxxxx 10xxxxxx

라틴 문자, 그리스 문자, 키릴 문자, 콥트 문자, 아르메니아 문자, 히브리 문자, 아랍 문자 는 2바이트

BMP(Basic Mulitilingual Plane) 안에 들어 있는 것은 3바이트(한글, 일본어 포함)

"""

pattern = "[\xE0-\xFF][\x80-\xFF][\x80-\xFF]"

count=0

for match in re.finditer(pattern, str):

count = count + 1

return count

저작자표시

+ Recent posts

Powered by Tistory, Designed by wallel

티스토리툴바