Tuesday, October 30, 2012

Migrating subset of database from singlebyte WE8MSWIN1252 to multibyte UTF8


This is a small showcase for exporting a small subset of a single byte character set database and importing to multibyte character set.

This showcase does not make use of Csscan tool.
This tool analyzes the code points in the database and verifies that they are known in the source the database character set and known in the target database character set.
The output of the Csscan utility is vital to determining the health of the data in the database, and this becomes especially crucial when changing character sets, as well as during import and export operations or database link transfers when source and target database differ in character set.


We will achieve this by setting the environment variable NLS_LANG to our source character set WE8MSWIN1252, when doing both the import and the export.

On the source db we create a demo table and populate it with 100,000 rows:

C:\Users\BERCA>set ORACLE_HOME=Q:\app\product\11.2.0\dbhome_1

C:\Users\BERCA>set ORACLE_SID=HORIA2

SQL> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for 64-bit Windows: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

SQL> set linesize 300
SQL> set pagesize 80
SQL> select * from nls_database_parameters
  2  where parameter='NLS_CHARACTERSET';

PARAMETER                      VALUE
------------------------------ ----------------------------------------
NLS_CHARACTERSET               WE8MSWIN1252


SQL> drop table v1;

Table dropped.

SQL> drop sequence seq;

Sequence dropped.

SQL> create table v1 (id number(9), col varchar2(144));

Table created.

SQL> create sequence seq start with 100 maxvalue 1000000000;

Sequence created.

SQL> set timing on;
SQL> begin
  2  for i in 1..100000 loop
  3   execute immediate 'insert into v1 (id,col) values
  4  (seq.nextval,''1234567890abcdefghijklmnopqrstuvwxyz1234567890abcdefghijklmnopqrstuvwxyz1234567890abcdefgh
ijklmnopqrstuvwxyz1234567890abcdefghijklmnopqrstuvwxyz'')';
  5  end loop
  6  commit;
  7  end;
  8  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:04.76
SQL> select count(*) from v1;

  COUNT(*)
----------
    100000

Elapsed: 00:00:00.14
SQL> set timing off;
SQL> select sum(bytes) from user_segments where segment_name='V1';

SUM(BYTES)
----------
  18874368

SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

C:\Users\BERCA>set NLS_LANG=AMERICAN_AMERICA.WE8MSWIN1252

C:\Users\BERCA>expdp horia/horia DIRECTORY=temp DUMPFILE=v1_exp.dmp TABLES=v1

Export: Release 11.2.0.3.0 - Production on Tue Oct 30 14:52:31 2012

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "HORIA"."SYS_EXPORT_TABLE_01":  horia/******** DIRECTORY=temp DUMPFILE=v1_exp.dmp TABLES=v1
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 18 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
. . exported "HORIA"."V1"                                14.68 MB  100000 rows
Master table "HORIA"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for HORIA.SYS_EXPORT_TABLE_01 is:
  Q:\APP\TEMP\V1_EXP.DMP
Job "HORIA"."SYS_EXPORT_TABLE_01" successfully completed at 14:52:41


On the target, where we have NLS_CHARACTERSET = AL32UTF8, we alse set the variable NLS_LANG to WE8MSWIN1252:

C:\Users\BERCA>set ORACLE_HOME=Q:\app\product\11.2.0\dbhome_1

C:\Users\BERCA>set ORACLE_SID=HORIA

C:\Users\BERCA>set NLS_LANG=AMERICAN_AMERICA.WE8MSWIN1252

SQL> select banner from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for 64-bit Windows: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production


SQL> set linesize 300
SQL> set pagesize 80
SQL> select * from nls_database_parameters
  2  where parameter='NLS_CHARACTERSET';

PARAMETER                      VALUE
------------------------------ ----------------------------------------------------------------------------
---------------------------------------------------------------------------------
NLS_CHARACTERSET               AL32UTF8


C:\Users\BERCA>impdp horia/horia DIRECTORY=temp DUMPFILE=v1_exp.dmp

Import: Release 11.2.0.3.0 - Production on Tue Oct 30 14:55:35 2012

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Master table "HORIA"."SYS_IMPORT_FULL_01" successfully loaded/unloaded
Starting "HORIA"."SYS_IMPORT_FULL_01":  horia/******** DIRECTORY=temp DUMPFILE=v1_exp.dmp
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. . imported "HORIA"."V1"                                14.68 MB  100000 rows
Job "HORIA"."SYS_IMPORT_FULL_01" successfully completed at 14:55:39


SQL> select sum(bytes) from user_segments where segment_name='V1';

SUM(BYTES)
----------
  18874368

An observation we cam make is that the size of the table remains the same.
How is this possible?

US7ASCII characters (A-Z,a-Z,0-1 and ./?,*# etc..) are in AL32UTF8 always 1 byte, so for most West European languages the impact is rather limited as only "special" characters like accented e a etc will use more bytes than in an 8 bit character set.
When converting a Cyrillic or Arabic system to AL32UTF8 then all the Cyrillic or Arabian data will take considerable more bytes to store.

You may find useful to research these notes as a reference for your projects:

    AL32UTF8 / UTF8 (Unicode) Database Character Set Implications [ID 788156.1]
    B.2) How much will my database grow when going to AL32UTF8?
    B.5) Going to AL32UTF8 from another characterset.
  Changing the NLS_CHARACTERSET to AL32UTF8 / UTF8 (Unicode) [ID 260192.1]
  Csscan output explained [ID 444701.1]
  Storing and Checking Character Codepoints in a UTF8/AL32UTF8 (Unicode) database [69518.1]
  ORA-01401 / ORA-12899 While Importing Data In An AL32UTF8 / UTF8 (Unicode) Or Other Multibyte NLS_CHARACTERSET Database. [ID 1297961.1]