Wednesday, April 26, 2006

Implementation Inheritance

Inheritance Series Continued Part 2

Continuing on the previous entry, another interesting problem that can be solved be the use of multiple inheritance is Streams. When dealing with streams, we need to deal with streams from particular types of destinations and then we need to format the stream into data types so that higher level classes could handle in terms of specific data types.

To solve this problem using multiple inheritance we could imagine that we can have
  1. Classes implementing a Stream interface on specific low level destinations such as file systems, network stream etc.
  2. Classes implementing formatters which know how to format higher level data types such as integer, strings etc. These classes could inherit from low level stream classes to provide specialized classes that can stream data at higher levels of abstraction. For example, we could stream in strings, integers etc into file system streams, network connections etc.

In the above usage scenario of multiple inheritance, if we had "n" number of stream types, and then we had "m" number of data types, then typically we could have n*m number of combinations of inheritance. This provides us with great flexibility in implementation.

To be continued ...

Saturday, April 22, 2006

Interface and Implementation Inheritance

Inheritance Series Part 1

Recently while working on a problem, I ran into an issue with multiple-inheritance. While the problem itself was due to a compiler bug, it made me think about the virtues and vices of inheritance. Here, I am logging some of my current thoughts.

It seems that in object oriented paradigm, inheritance has realy two broad based uses

  1. Polymorphism
  2. Code Re-use

Polymorphism

Inheritance is at the heart of polymorphism and is one of the stated missions in object orientation. Though this is not the only way of achieving this (as in smalltalk), its use along with late binding provides for a very elegant solution. In this use case there are basically two points to be noted -

  1. Means to specify the contract which the outside world will use - Interface Specification
  2. Means to implement and stagger the implementation - which is obviously supported by inheritance. Single inheritance hierarchy scheme is more than sufficient to stagger the implementation from a generalized version to more specific versions

In other words what we are really doing is that we are inheriting the interface for the purpose of implementation and specialization and this concept is also called Interface Inheritance. Specialization is not implementation inheritance as we are not inheriting the implementation, but overriding it.

Code Re-use

Another use of inheritance and this is my current object of curiosity is Implementation Inheritance. This concept is generally considered a taboo and languages such as Java have banished this all together. C++ on the other hand provides even means to achieve this.

The idea of implementation inheritance is to use mix-in classes to inherit so that the derived classes can inherit the functionality provided by the mixed in class to achive its implementation without trying to expose the interface of the mixed in class. This fact is pretty important.

Lets consider a simple situation where multiple inheritance could be useful. Consider the case code snippet as shown below -


class ListImpl
{
public:
void add(const Element&);
void add(const Element&, int pos)
Element& remove(int pos);
Element& operator[](int pos);
};

class Stack
{
public:
virtual void push(const Element&) = 0;
virtual Element& pop() = 0;
}

class StackImpl :
: public Stack // inherit the interface
: protected ListImpl // part of implementation
{
public:
void push(const Element&);
Element& pop();
}

In the above example, the StackImpl inherits from the Stack ABC to inherit its interface which will be used by its clients. This is typical and what we normally understand. However, the inheritance of ListImpl is part of the stack implementation and is not part its supported and exposed interface and hence is Implementation Inheritance. C++ provides a nice way to hide the implementation inheritance from the users by letting the sub class inherit either privately or protectedly. Had it inherited publically, then the List interface would also have been part of the exposed interface of StackImpl class and this is not what we want.

This is not the only design choice available for Stack implementors. We could have also used Composition relationship between StackImpl and ListImpl class. This would have been a great way and possibly the best way, but if StackImpl needed to access something internal to ListImpl (Probably StackImpl and ListImpl are implemented by the same developer and StackImpl needs to access something internal in ListImpl for the sake of performance), then the only elegant way would have been to inherit it, otherwise, we would have had to expose ListImpl internals.

Friday, April 21, 2006

Cost of Exception handling

Exception Series Part 2

There are obviously a lot of apparent advantages of using Exception over return error value. However, I wanted to find out the cost of exception handling as compared to return value. So, I wrote a simple Java program and timed the "throughput".


public class Test {
static final int ITER = 10000000;
void func1() throws Exception {
throw new Exception();
}

int func2() {
return 1;
}

public static void main(String[] s) {
Test t = new Test();

// some dumb operation to be done
// in the exc handler and if statement
long someJob1, someJob2;
someJob1 = someJob2 = 0;

// compute the time taken for exceptions
// catch
long start = System.currentTimeMillis();
for (int i = 0; i < ITER; i++) {
try {
t.func1();
}
catch(Exception e) {
// some dumb operation
someJob1++;
}
}
long end = System.currentTimeMillis();
long milli1 = end - start;

// compute time taken for return value
// check
start = System.currentTimeMillis();
for (int i = 0; i < ITER; i++) {
if (t.func2() == 1) {
// some dumb operation
someJob2++;
}
}
end = System.currentTimeMillis();
long milli2 = end - start;

System.out.println("Time spent on exc handling = " + milli1);
System.out.println("Time spent on error checking = " + milli2);
}
};


The outcome on compaq nx7010 laptop running Windows XP Professional on 1 GB RAM using JDK 1.4.2 is -

Time spent on exc handling = 27297
Time spent on error checking = 50


This is amazing. However, one thing to note is that these figures are a bit misleading. Ofcourse exception handling is expensive, but this will not be part of the core path. Remember, the exception is handled in only "exceptional" situations.

Returning Error values and Exception handling

Exception Series Part 1

Returning and handling error values was the only option available to programmers using procedural language such as C. With the introduction of technologies such as SEH on Windows and concepts such as OO, Object Oriented languages such as C++ gave programmers another approach for handling errors by using C++ Exceptions. However since C++ is backward compatible with C, the error value returning approach is still widely used. Newer languages such as Java have gone one step ahead supporting newer concepts such as checked and unchecked exceptions.

In this article, I try to document some of the issues I have encountered when programming using error values and some comparison between the two approaches.

Consistency issues when returning error values

Returning error values from methods are very context sensitive and this nature leads to lot inconsistencies in the design. For example, some method could return a status value to indicate failure or success, while some other could return null address to indicate failure and thus fracturing the design. Moreover, if a method had to return an integer by its nature, then the error value will need to be a “special value” in the range of valid values, which needs to be handled specially. For example, consider a method, which returns the employee ID. We could have a negative value to indicate error value and a positive number if a legal employee id. However, then we could have a method, which needs to return a tri-state such as positive if greater, negative if lesser and 0 if equal. How do we qualify a return value as error here? Such issues over a considerable period will lead to a lot of design level inconsistencies in the project.

Strange signatures

Many C libraries to be consistent in their API structure take a general approach to always return a status about the outcome. Such design though provides a consistent interface, unnecessarily complicates the signatures. For example an API to return employee ID would then return some positive value on success and 0 on failure and return the employee ID itself as an "out" parameter leading to a very clumsy API structure. Where you would have expected the employee ID to be returned by the method, now you have to pass in the reference or pointer to a variable that will hold the employee ID.

Complexity and Modularization

On Unix systems, “errno” is famous for its pitfalls. Many Unix System calls set this global variable and provide APIs to retrieve the error number making such designs unnecessarily complex. Also, error checking needs to be done right in the context while making invocations and thereby tightly coupling the business logic and error handling. If the global error value is missed for one method, then another could overwrite the old error value. Also, this approach forces error value check after every call. For example -


STATUS doSomething(int a, int b) {
STATUS st;
st = doThing1(a);
if (st != SGOOD) return st;
st = doThing2(b);
if (st != SGOOD)
return st;
return SGOOD;
}


Here doSomething is an intermediate context and it unnecessarily has to carry the overhead of propagating the error context to the calling function.

Exceptions

With the introduction of the concept of Exception, designers can cleanly separate the business logic from the exception handling. The code is much cleaner, where we dont have to check for error values after each and every method invocation. Also, with the introduction of RTTI - Runtime Type Information - exception handling can be very sophisticated which the C programmers could only dream of. doSomething method could be written simply as -

void doSomething(int a, int b) {
doThing1(a);
doThing2(b);
}

The caller of doSomething is bothered about error handling, but the intermediate context - doSomething - does not have to bother anything.

Checked and Unchecked Exception

In C++, programmers have the fexibility to catch or leave an exception being thrown from a called method. Ofcourse, if the exception is not handled any point in the stack, then the thread or the process could be terminated. All exceptions are said to be Unchecked.

However, in Java, any exception inheriting from java.lang.Exception (but for those inheriting from Runtime exception) needs to be caught explicitly in the calling context or else needs to be stated in the signature that it is throwable. The idea here is that the language is forcing the programmer to acknowledge the exception and act. Whether this is good or bad is very debatable and a lot of discussion can be "grepped" over in google, but one experience that I have faced is that lot of us "lazy" programmers end up consuming the excpetion unnecessarily.

Wednesday, April 19, 2006

Array Declaration in IDL and its C++ mapping

CORBA IDL Series Continued (Part 7)

IDL Declaration

<type_dcl> ::= “typedef” <type_spec> <array_declarator>

<array_declarator> ::= <identifier> <fixed_array_size>+

<fixed_array_size> ::= “[” <positive_int_const> “]”

Here the type could be any type.

Arrays could be multidimensional and their sizes are fixed at compile time. One difference between an array and a sequence is that during transmission, all the elements are sent across on the wire.

C++ Mapping

All IDL arrays are mapped to C++ arrays of the C++ mapping of the IDL types, but for string, wstring, all interface types and valuetypes. For these types, the array is mapped to array of their var type.

An array slice is an array with all the dimensions of the original specified, but for the first one. An array slice is also mapped as name of the array suffixed by _slice. For example,

// IDL
typedef long LongArray[4][5];
// C++
typedef Long LongArray[4][5];
typedef Long LongArray_slice[5];

Other functions

Also for an array type Foo, the mapping provides functions
// C++
Foo_slice *Foo_alloc();
Foo_slice *Foo_dup(const Foo_slice*);
void Foo_copy(Foo_slice* to, const Foo_slice* from);
void Foo_free(Foo_slice *);

Var Variables

The var variable has a default constructor and another taking the slice pointer, apart from the copy constructor.

Also, instead of overloading -> operator, the subscript operator is overloaded in the var variable.

Parameter Passing

For an array of type T, the parameter passing rules are as follows -

// IDL
typedef Foo FooArray[xx]; // array of fixed type
typedef Bar BarArray[yy]; // array of variable type
interface I {
FooArray op1(in FooArray, inout FooArray, inout FooArray);
BarArray op1(in BarArray, inout BarArray, inout BarArray);
};

// C++
class I ... {
..
virtual FooArray_slice* op1(const FooArray, FooArray, FooArray);
virtual BarArray_slice* op2(const BarArray, BarArray, BarArray_slice*&);
};

class I ... {
..
virtual FooArray_var op1(const FooArray_var&, FooArray_var&, FooArray_var&);
virtual BarArray_var op2(const BarArray_var&, BarArray_var&, BarArray_var&);
};

Sequence Declaration in IDL and its C++ mapping

CORBA IDL Series Continued (Part 6)

Sequences are also referred to as Templatized types. It is basically a one dimensional array with two exta characteritsics -
  1. Maximum size fixed at compile time
  2. Length which is determined at runtime
Sequences could be bounded or unbounded. For bounded sequences, the runtime length cannot be greater than the bounded sequence max length. If this is attempted in VisiBroker, BAD_PARAM exception is thrown.

Declaration in IDL

<sequence_type> ::= “sequence” “<” <simple_type_spec> “,”
<positive_int_const> “>”
| “sequence” “<” <simple_type_spec> “>”

C++ Mapping

Sequences map to a class in C++.

There are four types of constructors -
  1. Default Constructor
  2. Max Value Constructor
  3. Data Constructor
  4. Copy Constructor
Default constructor initializes length to 0 for both types of sequences. For unbounded sequences, max length is further initialized to 0. The release flag is set to true, meaning that the memory is owned by the sequence.

Max length constructor is only provided for unbounded sequences, which initializes the max length to the input parameter. Later however, if the length is increased or resized, this max length can be ignored. This is not provided for bounded sequences. The release flag is also true here.

Data constructor allows external data to be initialized into the sequence. This is called T* constructor. This constructor takes a pointer to an array of data, length of the input array, a max length for unbounded sequences and a release flag which suggests whether the sequence is to own the array or not. If the release flag is true, then the array should have been allocated using allocbuf() function call. The Sequence will then use the freebuf call to release the array.

Copy constructor initializes the new sequence with the same max length and length as that of the other array and then copies each element one by one and sets the release flag to true

Assignment operator similarly first deallocates each of the current content one by one and then makes a copy of the other sequence and sets the release flag to true

length() accessor and modifiers allow resizing the sequence.

maximum() returns the max length

Subscript operators operator[] is overloaded so that it can be used as both an lvalue and an rvalue. When used as lvalue, the data at the index is supposed to be written into.

There is an interesting case when the types contained in the sequence are string, wstring, any type of interface or valuetype. The issue occurs whenever, the allocbuf() function for the sequence is mapped to return a type T** instead of T* - is return an array of pointers instead of an array of objects. In this case, if the release flag is false, when using lvalue subscript operator, the mapping cant just deallocate the old contents. The mapping will automatically take care of this for the user. So, if the release flag is true, then the content is automatically deleted, otherwise not. This is illustrated in the following example -
// IDL
typedef sequence StringSeq;

// C++
char *static_arr[] = {"one", "two", "three"};
char **dyn_arr = StringSeq::allocbuf();
dyn_arr[0] = string_dup("one");
dyn_arr[1] = string_dup("two");
dyn_arr[2] = string_dup("three");
StringSeq seq1(3, static_arr);
StringSeq seq2(3, dyn_arr, TRUE);
seq1[1] = "2"; // no free, no copy
char *str = string_dup("2");
seq2[1] = str; // free old storage, no copy

get_buffer() and replace provide further capability to get the underlying content of the sequence and to replace the contents with new data.

Sequence_var

Sequence vars when assigned contant pointers to sequences do not duplicate the contents of the sequence but only increment the reference count.

Union Declaration in IDL and its C++ mapping

CORBA IDL Series Continued (Part 5)

Unions are like the C++ unions, value of which could be any of the types as represented in the union. In other words, it is an ordered list of identifiers and the value of the union itself is the value of the identifier chosen by the descriminator. If not all values of the discriminator are represented, then a notion of default exists.

Constructwise, a union has
  1. Descriminator - which defines the type of the case label
  2. case statements - which constist of a case label and case value; case label being a constant value of the descriminator and case value is the union value when the case label is selected
  3. Default value when no descriminator is selected or selection is out of range of the discriminator values.

Declaration in IDL

<union_type> ::= “union” <identifier> “switch”
“(” <switch_type_spec> “)” “{” <switch_body> “}”

<switch_type_spec> ::=
<integer_type>
<char_type>
<boolean_type>
<enum_type>

<switch_body> ::= (<case_label>+ <element_spec> “;”)+

<case_label> ::= “case” <const_exp> “:”
“default” “:”

<element_spec> ::= <type_spec> <declarator>

Mapping in C++

Unions map to C++ classes with the following member functions
  1. accessors: which provide a read only value
  2. mutators: which change the value; parameters are passed by value for small types and passed by constant reference to larger data types members; perform equivalent of deep copy and return nothing
  3. referents: read-write methods; Available for only struct, union, sequence and any member.
  4. union descriminator accessor and modifier - is always _d()
  5. Default accessor: If union does not have any default case, then _default() is automatically generated
Like structure mappings, this mapping totally manages its member memory. Copy constructor, assignment operators and destructor totally take care of this.

Salient notes about this mapping

Union instances should not be used with initialization. No default initialization should be assumed. By default, VisiBroker sets ths state to a legal default value.

When a union object is created, it needs to be initialized by calling one of the modifiers methods. This causes the descriminator to be set to the descriminator value corresponding to that particular type.

Trying to set descriminator value outside the current value set is illegal behaviour. In VisiBroker BAD_OPERATION exception is thrown.

When a different member modifier is called, the previous member is cleaned up.

union Foo switch (long) {
case 1: string str;
case 2:
case 3: long l;
default: char c;
};

C++ code
Foo f; // union not initialized
f._d(2); // illegal; throws BAD_OPERATION
f.str("Sandesh"); // initialized string member
f.l(20); string member is cleaned up
f._d(2); // legal
f._d(3); // legal
f.c('A'); // default value chosen
f._d(4); // okay as 4 is out of range of the discriminator and hence default value

Recursion

Recursion rules are similar to that of the structures. This is done through either anonymous sequences, or sequences of incomplete types by using forward declaration. If forward declarations are used, the incomplete constructs can only be used inside a union which is defining the incomlpete type. For more information see the BLOG - "Structure Declaration in IDL and its C++ mapping"

Memory Management of unions

Memory management rules are very similar to that of structures. Similar to structures, if a union has fixed length members it is a Fixed Length union, otherwise it is a variable length union. For more information see the BLOG - "Structure Declaration in IDL and its C++ mapping"

Structure Declaration in IDL and its C++ mapping

CORBA IDL Series Continued (Part 4)

Declaration in IDL - Duplicated here for readability.

<struct_type> ::= “struct” <identifier> “{” <member_list> “}”

<member_list> ::= <member>+

<member> ::= <type_spec> <declarators> “;”

Recursion

IDL allows recursion of structures by having members which are sequence types of its own incomplete type. Note that a structure (or for that matter union) is termed incomplete until its definition is completed with a closing "{". Recursion is allowed by using the following techniques
  1. Anonymous sequence member (Deprecated usage)
  2. Sequence using forward declaraed struct type.

The examples below illustrate the two -

struct Foo {
long value;
sequence<Foo> chain; // anonymous sequence member
};

struct Bar; // Forward declaration
typedef sequence<Bar> BarSequence; // sequence of incomplete type
struct Bar {
long value;
BarSequence chain; // recursion
};

Some rules regarding the above -
  • As said earlier, when structures are forward declared, they are termed incomplete as their definition is still not seen. It is a rule that the structure should be completed in the same IDL file. So, forward declaration of structures defined in other IDLs is not permitted.
  • When sequences are thus used for recursion, the sequence should be member of the structure definition defining the incomplete struct contained in the recursion.
  • Incomplete type definitions can only appear as element type of a sequence definition and such as sequence is called Incomplete Sequence Type.
  • Incomplete Sequence types can appear only as element types of another sequence or as a member of structure or union.
Mutual recursion is not possible because of the above rule. For example, the snippet below is illegal -

struct Foo;
typedef sequence<Foo> FooSeq;
struct Bar {
FooSeq chain; // illegal as Foo is incomplete
};
struct Foo {
long l;
};

The same rules apply to unions as well.

Please note that incomplete structures and unions are said to be complete after it has been defined. Hence any such sequence can also be used after the structure is defined.

Structure C++ Mapping

A structure is mapped to to C++ struct or a class with default and copy constructor, assignment operator and a destructor. All the members are publically accessible and the memory management of the internal members is internal to the structure and users of the structure dont have to worry about that. What this means is that the default constructor initializes all the data members appropriately, the copy constructor does a deep copy of the members (such as duplicating object references, copy string data etc), assignment operator first releases the current member memory and then does deep copy and the destructor deletes all member memory.

Fixed and Variable length structures

If a structure contains any variable lenght members such as object reference or string, then the structure is called Variable Length Structure. This has implications when passing the structure though an operation.

For example consider the following IDL

struct Foo {
long l; // Fixed len
};
struct Bar {
string s; // Variable len
};
interface I {
Bar op1(in Bar, inout Bar, out Bar);
Foo op2(in Foo, inout Foo, out Foo);
}

When sending in and reading out fixed length structures, since the memory needed is fixed, the caller is responsible for allocating the memory. However for variable length structures, the caller needs to allocate only for the in and inout parameters and the callee will allocate for inout and out parameters. So, the mapping for the two operation becomes -

class I ... {
...
virtual Bar op1(const Bar&, Bar&, Bar&); // For Fixed length
virtual Foo* op2(const Foo&, Foo&, Foo*&); // For Variable length
};

Var types

To hide the above complication, the mapping provides a helper class call var class. For every structure T, a helper class T_var is also generated.

Some of the salient features of this var types is


  • pointer constructor owns the pointer
  • copy constructor deep copies the structure
  • assignment operator releases old and then deep copies
  • destructor deletes the contained structure.
  • Also hides the parameter passing complications.

With the _var type, the above operations can be consistent as follows -

virtual Bar_var op1(const Bar_var&, Bar_var&, Bar_var&); // For Fixed
virtual Foo_var op2(const Foo_var&, Foo_var&, Foo_var&); // For Variable

The T_var type also has in(), inout(), out() and _retn() functions. To avoid any C++ compiler bug which does not allow correct casting, we could use these functions when making calls. For example,

Foo_var f = ...;
f = obj->op2(f.in(), f.inout(), f.out());
return f._retn();

Type Declaration in IDL

CORBA IDL Series Continued (Part 3)

Apart from the basic types, interfaces and valuetypes, other customized types can also be created. This can be done either by

  1. Creating New Types
  2. Typedefining existing types to new types

Creating New Types

New types can be created either by using

  1. Structures
  2. Descriminated Unions
  3. Enumerations

Typedefining existing types to new types

By using the typedef keyword, existing types can be given new type names. This can be done by

  1. Simple type definition using existing base types to create new names
  2. Using templatized types such as sequence, string and wstring
  3. Arrays

General syntax


<type_dcl> ::= “typedef” <type_declarator>
| <struct_type>
| <union_type>
| <enum_type>
| <constr_forward_decl>


Structures, Unions and Enums:

<struct_type> ::= “struct” <identifier> “{” <member_list> “}”

<member_list> ::= <member>+

<member> ::= <type_spec> <declarators> “;”

<union_type> ::= “union” <identifier> “switch”
“(” <switch_type_spec> “)”
“{” <switch_body> “}”

<switch_type_spec> ::= <integer_type>
| <char_type>
| <boolean_type>
| <enum_type>

<switch_body> ::= <case>+

<case> ::= <case_label>+ <element_spec> “;”

<case_label> ::= “case” <const_exp> “:”
“default” “:”

<element_spec> ::= <type_spec> <declarator>

<enum_type> ::= “enum” <identifier>
“{” <enumerator> { “,” <enumerator> }* “}”

<enumerator> ::= <identifier>

Using typedef keyword

<type_declarator> ::= <type_spec> <declarators>

<type_spec> ::= <simple_type_spec>
| <constr_type_spec>

<simple_type_spec> ::= <base_type_spec>
| <template_type_spec>

<base_type_spec> ::= <floating_pt_type>
| <integer_type>
| <char_type>
| <wide_char_type>
| <boolean_type>
| <octet_type>
| <any_type>
| <object_type>
| <value_base_type>

<template_type_spec> ::= <sequence_type>
| <string_type>
| <wide_string_type>

<constr_type_spec> ::= <struct_type>
| <union_type>
| <enum_type>

<declarators> ::= <declarator> { “,” <declarator> }*

<declarator> ::= <simple_declarator>
| <complex_declarator>

<simple_declarator> ::= <identifier>

<complex_declarator> ::= <array_declarator>

Tuesday, April 18, 2006

Interfaces, Local Interfaces, ValueTypes and Abstract Interfaces

CORBA IDL Series Continued (Part 2)

OMG IDL Interface:

Interfaces form the heart of the client-server contract. In IDL it is represented using the following EBNF -

<interface> ::= <interface_dcl> <forward_dcl>

<interface_dcl> ::= <interface_header> “{” <interface_body> “}”

<forward_dcl> ::= [ “abstract” “local” ] “interface” <identifier>

<interface_header> ::= [ “abstract” “local” ] “interface”
<identifier> [ <interface_inheritance_spec> ]

<interface_body> ::= <export>*

<export> ::= <type_dcl> “;” <const_dcl> “;” <except_dcl> “;” <attr_dcl> “;” <op_dcl> “;”

Interfaces also forms a namespace for identifiers scoped inside it. Apart from the attribute and operation declarations, types, exceptions and constants can also be defined inside an interface namespace.

Interfaces support multiple inheritance.

Salient notes about Interfaces

Derived interfaces can redefine any identifier previously defined in the base interfaces, but for attributes and operations.

An interface cannot be direct base of another more than once; however it can be indirect base multiple times. For example

interface A {...}
interface B : A {...}
interface C : A, B {...} // okay
interface D : A, B, A {...} // error

Derived interfaces can refer identifiers defined in base interfaces by using complete scope. The references need to obviously unambiguous. Also references to identifiers get bound to interface when they are defined. Their binding does not change if it is redefined. For example

const long L = 3;
interface A {
typedef float coord[L]:
void f (in coord s); // s has three floats
};
interface B {
const long L = 4;
};
interface C: B, A { }; // f is still taking fload coord[3]

Local Interfaces:

These interfaces cannot be marshalled out - sent in a remote operation; hence are local. Any type containing these themselves become local. They can however be used in valuetype operations. Also valuetypes can support a single local interface as supporting an interface does not make the valuetype syb types of the inetrface. Such valuetypes can then be marshalled.

The table below shows inheritance structures


-----------------------------
I LI AI VT BVT AVT
-----------------------------
I M - M - - -
-----------------------------
LI M M M - - -
-----------------------------
AI - - M - - -
-----------------------------
VT SS SS SM S - M
-----------------------------
AVT SS SS SM - - M
-----------------------------
BVT - - - - - -
-----------------------------
I - Remote Interface
LI - Local Interface
AI - Abstract Interface
VT - Value Type
BVT - Boxed Value Type
AVT - Abstract Value Type
M - Can inherit multiple
- - Cannot inherit or support
SS - Can support single
SM - Can support multiple
ValueTypes

In CORBA, objects passing is by reference. When an IDL interface parameter is passed, it is converted to a reference and is passed to the peer. Any method call on the reference effects the original object. Sometimes, it is useful to encapsulate data only and the methods just manipulate the data. When such objects are passed, it needs to be copied so that peers can independently work on the encapsulated data. Such a requirement is not catered for by IDL interfaces. IDL structs do allow the data to be copied across between peers, but does not allow encapsulating data.

To answer the above need, ValueTypes were introduced. ValueTypes are in a way cross between interfaces and structs. While it allows method calling and inheritance, its value is copied over instead of reference being passed. The peer end constructs the Valuetype instances with the data that has come over the wire.

Types of ValueTypes

ValueTypes could be either
  • concrete value types
  • abstract value types.
Concrete valuetypes are those that have data encapsulation. Such valuetypes, obviously can be instantiated. On the other hand, when a valuetype is a pure bundle of operation and contains no state, it is said to be abstract and cannot be instantiated.

Boxed valuetypes is a special type of concrete valuetypes which does not have any operations and has only a single datamember. Typically this is used if data structures such as string or sequences need null parameter passing.

EBNF of ValueType

<value> ::= ( <value_dcl> <value_abs_dcl> <value_box_dcl> <value_forward_dcl>)

Regular Value Type

<value_dcl> ::= <value_header> “{“ < value_element>* “}”

<value_header> ::= [“custom” ] “valuetype” <identifier> [<value_inheritance_spec> ]

<value_element> ::= <export> < state_member> <init_dcl>

<value_inheritance_spec> ::= [ “:” [ “truncatable” ] <value_name> { “,” <value_name> }* ][ “supports” <interface_name> { “,” <interface_name> }* ]

<value_name> ::= <scoped_name>

State declaration

<state_member> ::= ( “public” “private” ) <type_spec> <declarators> “;”

Important charecteristics of ValueTypes
  • ValueTypes are local. They are implemented and reside locally. When a valuetype is sent across the wire, only its state is really sent and the object is recreated at the peer end using the state marshalled across.
  • Valuetypes can singly inherit from concrete valuetype and multiply inherit from abstract valuetype. I presume the restriction of single inheritance from concrete valuetype is to avoid the issues because of multiple implementation inheritance. Valuetypes can also support a single IDL interface or a local interface. Further more, it can support multiple abstract interfaces.
  • Sharing semantics - Valuetype instances can be shared between other valuetype instances. This allows the preservation of relationship between instances when these objects are marshalled over.
  • Null semantics - strings, structs, sequences and other data types in IDL do not support passing of null value. However, this can be done for a ValueType object.
  • Copy semantics of the valuetype are only guarenteed when the valuetype instances are part of an IDL interface method signature. If a valuetype instance is used in a language specific function call or as part of a valuetype operation, then these instances are passed using the programming language specific reference semantics.
Inheritance in ValueTypes

  • Abstract valuetypes can inherit from any number of abstract valuetypes and support any number of abstract interfaces. It can also additionally support a single interface. Abstract valuetypes obviously cannot inherit from concrete valuetypes.
  • Concrete valuetypes can only inherit from a single concrete valuetype. It however can inherit from multiple abstract valuetypes and support multiple abstract interfaces. It can also additionally support a single interface. An interesting problem arises when concrete valuetypes inherit from other concrete valuetypes. When they are marshalled over, it could so happen that the recieving peer only has information till some of the parent valuetypes. Specifying this valuetype, then to be truncatable allows the receiving side to truncate the value to the parent valuetype.
Substitutability

When a valuetype instance is to be passed into an CORBA operation that takes -

1) Interface reference - When a valuetype supports an interface, it is not naturally the subtype of the interface. It cannot be passed in situation where a paremeter type is the base interface. In this case, through special language mapping mechanisms, the valuetype needs to be converted to the interface reference type to be passed.

2) Abstract Interface - Derived valuetype instance is a subtype of the abstract interface and hence can be passed.

3) ValueType - Yes. If the receiving peer has the same version of the implementation of valuetype, then it is no problem. If not, then it tries to (1) load the implementation if possible (2) truncate to a base class if truncatable (3) throw NO_IMPLEMENT.

CORBA IDL Overview

CORBA IDL Series (Part 1)

CORBA IDL is the declarative language using which Interface constructs for CORBA can be specified. This series of notes documents the basic IDL constructs and their C++ mappings.

EBNF

Note on EBNF:Full grammar is documented at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/Axapta/Appendix_about_EBNF/LANG_EBNF_grammar.asp.

In short the notations mean the following
* -> 0 or more
+ -> 1 or more
() -> Parentheses. Hold the symbols (terminals and non-terminals) in the parentheses brackets together. Can be placed anywhere at the right hand side in a production rule.
[] -> 0 or 1; Optional. The items between [ and ] are optional. All or none of the items in the brackets are included. This can be expressed as none or one instance.
{} -> group of one syntactic unit; Repeat. The items between { and } are optional, but can be repeated as many times as necessary. This can be expressed as none or more instances.
"test" -> literal
<>-> non terminal: can be expanded to one or more terminal symbols
-> alternatively; Or. Either all items on one side of the or all items on the other side
::= -> Is defined to be

IDL is constituted of following definitions -

  1. module
  2. interfaces
  3. valuetypes
  4. exceptions
  5. constants
  6. types

module

module construct basically provides namespace semantics and helps in providing scope to identifiers. It is represented in EBNF as follows -

<module> ::= “module” <identifier> “{“ <definition>+ “}”
<definition> ::= <type_dcl> “;”
<const_dcl> “;”
<except_dcl> “;”
<interface> “;”
<module> “;”
<value> “;”


module is mapped in C++ as either namespace or class on compilers not supporting namespaces.

Sunday, April 16, 2006

Unicode Character Encoding Model

Character Encoding Series (Part2)

Unicode is an Open Character Repertoire. Until 2000, it had less than 65,000 characters in its repertoire. But with the inclusion of characters from China, the number of characters now are more than 90,000.

The Unicode Character Set (UCS) can represent hundreds of thousands of abstract characters. Each numerical representation of a character is called CODE POINT. With the inclusion of the Chinese characters, it takes about 21 bits to represent all the Code points.

Unicode has two types of Code Set Encoding - (1) UCS Encoding (2) UTF Encoding. Some of the encoding present are UCS2, UTF16, UTF8, UCS4, UTF32 etc.

UCS2 - With the original Unicode Repertoire, when the character set was less than 65,000, all the Code points could be mapped to 16 bit values. UCS2 mapped all the characters to fixed 16 bit values ranging from 0x0000 to 0xFFFF. A part of this range, about 2048 values from 0xD800 to 0xDBFF and 0xDC00 to 0xDFFF were reserved for future expansion. This 16 bit value is known as CODE UNIT in the Unicode terminology.

UTF16 - With the inclusion of the Chinese Characters, UCS2 did not suffice to encode all the Code Points. The encoding of UCS2 was enhanced by adding CODE PLANES. Totally 17 Code planes constitute the UTF16 encoding. The original UCS2 encoding forms what is called the BASIC MULTILINGUAL PLANE(BMP). All the encodings in this range are fixed 16 bit of values. The further 16 planes are called SURROGATE PLANES. These are composed of Code Unit pairs with the first pair in the range 0xD800 to 0xDBFF and the second pair in the range 0xDC00 to 0xDFFF. So, a character in UTF16 could be either 16bit or 32 bits. For transmission and storage of UTF16 characters, BOM (Byte Order Marker) Code Point is used. This is a special character with value 0xFEFF (AKA - zero width no-break space). If BOM is present, it could be taken as the 16bit value are represented as BIG ENDIAN, otherwise, it could be LITTLE ENDIAN.

UTF8 - In this encoding scheme, the Unicode Code Points could be encoded from 8bits to 32 bits. It is basically represented as a sequence of octets. The advantages is that this is relatively compact and ASCII compliant. UTF-8 text files can also use BOM to indicate that the contents are Unicode text.

UCS4/UTF32 - Here all the character set are represented as fixed 32 bit values. This is not very frequently used.

Character Encoding Model

Character Encoding Series (Part1)

Character Encoding Model - The Encoding Model has 4 constituens - (1) Character Repertoire (2) Character Set (3) Character Encoding Form (4) Character Encoding Scheme

Character Repertoire represents what characters are available in the model. There are two types of repertoires (1) Closed Repertoires where the characters are fixed and the repertoire cant be added to (For ex. ASCII) (2) Open Repertoires where the repertoires are extensible (For ex. Unicode and Windows Code Pages)

Coded Character Set - This represents the numerical value for each character in the character repertoire. Same characters in different models can have different numerical value. In Unicode terminology, this is called CODE POINT and is represented as +U. For example +U0041 is the character 'A'.

Character Encoding Form - This specifies how the character numerical value is converted to fixed bit width values called CODE VALUES for the purpose of manipulation by computers. This conversion could be as simple as in the case of ASCII where the ASCII codes are mapped directly to 8 bit values or as complex as in the case of Unicode where there are multiple conversions possible such as UCS2, UTF16, UTF8, UCS4, UTF32 etc.

Character Encoding Scheme - This specifies how the numerical character code can be represented for the purpose of Storing and transmission. This specifies such specifications as BOM (Byte Order Marker) for UTF-16 etc.