Project Description

Generic Data Generator is a highly configurable and extensible data generator. It allows the user to fully customize the generated data to suit the specific domain while at the same time supporting multiple data stores.

During software development, it is common place to generate test data which can be utilised to validate the correct operation of the specific software. Such test data may be required for load tests, integration tests or even user acceptance tests. As such, the content of the test data may vary depending on the intent of the test. One common characteristic of such test data is that it must satisfy the constraints defined on the data store. This therefore implies that the scripts used for the test data generation must be continuously updated when the data store changes. In addition, it is often not trivia to utilise test data generated for one system in another system. Combining this with the fact that developing the test data generation script often require some knowledge of writing Structured Query Language (SQL), the enormity of this task becomes obvious.

To this end, we decided to develop a generic solution that will abstract the complexity of the underlying data store in such a manner that the user defines the test data in a pseudo-business language without having to know anything about SQL or any data access technology. Through this approach, the tool can be utilised within any software project and against any data store.

Configuration

In order for the tool to be reused in any software project and against and database, the user must be able to instruct the tool exactly what sort of data is required and which database to persist the generated data. To capture this information, the tool provides two interfaces through which the user can interact with the tool. The first is via a configuration file identifying environmental settings such as database connection strings, the type of database (e.g. oracle or MS SQL Server.
The second configuration interface affords the user the opportunity to specify exactly the structure and the content of data to be generated. In order to maintain the database schema agnostic nature of the tool, domain specific language (DSL) was developed. This DSL is based on the relational model and utilises the concept of Entity to represent a database table and Property to represent a column within a database table. Each table specified to the tool must contain at least one Property. In turn, the each property will specify a number of attributes through which the tool can identify how to generate the data for that property. To the user, this DSL is exposed as an XML file which must satisfy the following schema:

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="DataLoad">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" minOccurs="1" name="Pattern">
          <xs:complexType>
            <xs:sequence>
              <xs:element maxOccurs="unbounded" minOccurs="1" name="Entity">
                <xs:complexType>
                  <xs:sequence>
                    <xs:element name="Property" maxOccurs="unbounded" minOccurs="1">
                      <xs:complexType>
                        <xs:attribute name="propertyName" type="xs:string" use="required" />
                        <xs:attribute name="type" type="xs:string" use="required" />
                        <xs:attribute name="key" type="xs:boolean" use="optional" />
                        <xs:attribute name="references" type="xs:string" use="optional" />
                        <xs:attribute name="hint" type="xs:string" use="optional" />
                        <xs:attribute name="nullable" type="xs:boolean" use="optional" />
                        <xs:attribute name="default" type="xs:string" use="optional" />
                        <xs:attribute name="actualValue" type="xs:string" use="optional" />
                      </xs:complexType>
                    </xs:element>
                  </xs:sequence>
                  <xs:attribute name="entityName" type="xs:string" use="required" />
                  <xs:attribute name="tableName" type="xs:string" use="required" />
                </xs:complexType>
              </xs:element>
            </xs:sequence>
            <xs:attribute name="patternName" type="xs:string" use="required" />
            <xs:attribute name="quantity" type="xs:unsignedShort" use="required" />
            <xs:attribute name="insertIntoDb" type="xs:boolean" use="required" />
            <xs:attribute name="generateCSV" type="xs:boolean" use="required" />
            <xs:attribute name="persistSQL" type="xs:boolean" use="required" />
          </xs:complexType>
        </xs:element>
      </xs:sequence>
      <xs:attribute name="cleanDB" type="xs:boolean" use="required" />
    </xs:complexType>
  </xs:element>
</xs:schema>

The XML must contain a root node “DataLoad” with an attribute cleanDB=”false”. The DataLoad node must contain 1 to many “Pattern”. A Pattern node must have the following attributes:
  • patternName: The unique name of the pattern.
  • quantity: The number of times the content of this pattern will be processed. This value must be an integer
  • insertIntoDb: Indicates if the generated data should be inserted in the specified database.
  • generateCSV: Indicates if the tool should generate a CSV version of the generated data.
  • persistSQL: Indicates if the tool should save the SQL derived from the data generation in the file system. The user is required to specify the folder to save these files to.
Each pattern node must contain 1 to many “Entity” nodes. An Entity node corresponds to a database table and the user must indicate the unique name of the entity within the pattern and the table which the Entity points to. As shown below:

<Entity  entityName="PARTIES"  tableName="PARTIES"  identityInsert="false">

It should be noted here that multiple Entity nodes can point to the same table within the same Pattern node provided the entityName attributes are unique. The last attribute of the entity node is identityInsert. This attribute should only be set to true if the database is Microsoft SQL Server and the table which the entity node points to has an identity key. Otherwise, this attribute should be left as false.

Each Entity node contains 1 to many Property nodes. The Property node corresponds to a column within a database table and contains the following attributes:
  • propertyName: This identifies the column within the Entity specified database table that this property refers to.
  • type: This indicate the type of the property. Valid values are integer and float for numbers; DateTime for dates and timestamp; string for characters. This property is mandatory.
  • key: This indicates whether this property is the key property for this table. Key properties are unique. For the current release, only numeric keys are recommended. This attribute can either be true or false.
  • references: This attributes instructs the tool that the property value should be obtained from another Entity within this pattern. The corresponding Entity’s entityName must exactly match this attribute value. This attribute is optional. It will only be processed if the value is not null or empty string.
  • hint: This hint attribute allows the user to control how the data is generated for this Property. The user has a significant amount of leverage to shape the output data through this attribute. Current implementation supports the following hints: telephone (generates valid telephone numbers), emailaddress (generates valid email addresses), surname (generates real human names), firstname (generates real human names), max-500 (generates a string not longer that the value specified in this case 500), unique-2(generates a unique string not longer that the value specified in this case 2). This is also an extensibility point for the tool through which custom processing can be added to the tool. A regex hint is currently been developed as well.
  • nullable: This attribute is currently not processed.
  • default : This attribute is currently not processed.
  • actualValue: This attribute provides an avenue for the user to override completely any processing for this Property node. The tool will utilise the specified value literally. So if the value is a string, please enclose it with single quotes as shown below: actualValue="'SomeValue'"

As part of the environment configuration, the user is required to specify the location of this XML file.
A sample XML (data load) is displayed below:

<?xml version="1.0" encoding="utf-8" ?>
<DataLoad cleanDB="false">
  <Pattern patternName="Pattern1" quantity="10" insertIntoDb="true"  
           generateCSV="false"  persistSQL="true">
    <Entity  entityName="PARTIES"  tableName="PARTIES">
      <Property propertyName="PARTY_KEY"  type="integer" key="true"  
               references="" hint="" nullable="false" default="" actualValue="" />
      <Property propertyName="PARTIES2_KEY"  type="integer" key="false"  
                references="PARTIES2" hint="" nullable="false" default="" actualValue="" />
      <Property propertyName="DATE_CREATED"  type="DateTime" key="false"  
                references="" hint="" nullable="false" default=" " actualValue="" />
    </Entity>   
    <Entity  entityName="PARTIES2"  tableName="PARTIES2">
      <Property propertyName="PARTY_KEY"  type="integer" key="true"  
                references="" hint="" nullable="false" default="" actualValue="" />
    </Entity>   
  </Pattern>
</DataLoad>

Note: XML is case sensitive.

Operation

The tool reads and validates the data load. On successful validation, the tool processes each Pattern node defined within the DataLoad node. The references attribute of the Property node is utilised to work out the dependency hierarchy for all entities within the Pattern. Based on this, data for parent entities are generated first and then that of their dependents. Each Pattern processing is accomplished within a transaction and if an error occurs, the entire pattern data is not saved to the database. The processing applied to the pattern is repeated for the number specified in the quantity attribute of the Pattern node. For each time the Pattern processing is repeated, it should be noted that the data will be different and the database constraints within the database identified via the DataLoad will be maintained.
The tool then wraps the generated data within relevant SQL instructions adapted to the specified database. This SQL instruction is then executed on database.

Last edited Aug 30, 2011 at 7:54 PM by ekhor, version 15