What is Test Data? Test Data Preparation Techniques with Example

Learn what is Test Data and How to Prepare Test Data for Testing:

At the current epic of Information and Technology revolutionary growth, the testers commonly experience extensive consumption of test data in the software testing life cycle.

The testers don’t only collect/maintain data from the existing sources, but also they generate huge volumes of test data to ensure their quality booming contribution in the delivery of the product for real-world use. 

Therefore, we as testers must continuously explore, learn and apply the most efficient approaches for data collection, generation, maintenance, automation and comprehensive data management for any types of functional and non-functional testing.

What is Test data - Test Data Preparation

In this tutorial, I will provide tips on how to prepare test data so any important test case will not be missed by improper data and incomplete test environment setup.

What is Test Data and Why It’s Important

Referring to a study conducted by IBM in 2016, searching, managing, maintaining, and generating test data encompass 30%-60% of the testers time. It is undeniable evidence that data preparation is a time-consuming phase of software testing.

Testers Average Time Spent on TDM

Figure 1: Testers Average Time Spent on TDM

Nevertheless, it is a fact across many various disciplines that most data scientists spend 50%-80% of their model’s development time in organizing data. And now considering the legislation and as well as the Personally Identifiable Information (PII) makes the testers engagement overwhelmingly decent in the process of testing.

Today, the credibility and reliability of the test data are considered an uncompromised element for the business owners. The product owners see the ghost copies of the test data as the biggest challenge, which reduces the reliability of any application at this unique time of clients’ demand/requirements for quality assurance.

Considering the significance of test data, vast majority software owners don’t accept the tested applications with fake data or less in security measures.

At this point, why don’t we recollect on what Test Data is? When we start writing our test cases to verify and validate the given features and developed scenarios of the application under the test, we need information that is used as input to perform the tests for identifying and locating the defects.

And we know that this information needs to be precise and complete for making the bugs out. It is what we call test data. To make it accurate, it can be names, countries, etc…, are not sensitive, where data concerning to Contact information, SSN, medical history, and credit card information are sensitive in nature.

The data may be in any form like:

  • System test data
  • SQL test data
  • Performance test data
  • XML test data

If you are writing test cases then you need input data for any kind of test. The tester may provide this input data at the time of executing the test cases or application may pick the required input data from the predefined data locations.

The data may be any kind of input to the application, any kind of file that is loaded by the application or entries read from the database tables.

Preparing proper input data is part of a test setup. Generally, testers call it a testbed preparation. In testbed, all software and hardware requirements are set using the predefined data values.

If you don’t have the systematic approach for building data while writing and executing test cases then there are chances of missing some important test cases. The testers can create their own data according to testing needs.

Don’t rely on the data created by other testers or standard production data. Always create a fresh set of data according to your requirements.

Sometimes it’s not possible to create a completely new set of data for each and every build. In such cases, you can use standard production data. But remember to add/insert your own data sets in this existing database. One best way to create data is to use the existing sample data or testbed and append your new test case data each time you get the same module for testing. This way you can build comprehensive data set over the period.

Test Data Sourcing Challenges

One of the areas in test data generation, the testers consider is data sourcing requirement for sub-set. For instance, you have over one million customers, and you need one thousand of them for testing. And this sample data should be consistent and statistically represent the appropriate distribution of the targeted group. In other words, we are supposed to find the right person to test, which is one of the most useful methods of testing the use cases.

And this sample data should be consistent and statistically represent the appropriate distribution of the targeted group. In other words, we are supposed to find the right person to test, which is one of the most useful methods of testing the use cases.

Additionally, there are some environmental constraints in the process. One of them is mapping PII policies. As privacy is a significant obstacle, the testers need to classify PII data.

The Test Data Management Tools are designed to address the mentioned issue.  These tools suggest policies based on the standards/catalog they have. Though, it is not very much safe exercise. It still offers the opportunity of auditing on what one is doing.

To keep up with addressing the current and even the future challenges, we should always ask questions like When/where should we start the conduct of TDM? What should be automated? How much investment should the companies allocate for testing in areas of human resource on-going skills development and the use of newer TDM tools? Should we start testing with functional or with non-functional testing? And much more likely questions as them.

Some of the most common challenges of Test Data Sourcing are mentioned below:

  • The teams may not have adequate test data generator tools knowledge and skills
  • Test data coverage is often incomplete
  • Less clarity in data requirements covering volume specifications during the gathering phase
  • Testing teams do not have access to the data sources
  • Delay in giving production data access to the testers by developers
  • Production environment data may be not fully usable for testing based on the developed business scenarios
  • Large volumes of data may need in a short period of given time
  • Data dependencies/combinations to test some of the business scenarios
  • The testers spend more time than required for communicating with architects, database administrators and BAs for gathering data
  • Mostly the data is created or prepared during the execution of the test
  • Multiple applications and data versions
  • Continuous release cycles across several applications
  • Legislation to look after Personal Identification Information (PII)

On the white box side of the data testing, the developers prepare the production data. That is where QA’s need to work touch base with the developers for furthering testing coverage of AUT. One of the biggest challenges is to incorporate all possible scenarios (100% test case) with every single possible negative case.

In this section, we talked about test data challenges. You can add more challenges as you have resolved them accordingly. Subsequently, let’s explore different approaches to handling test data design and management.

Strategies for Test Data Preparation

We know by everyday practice that the players in the industry of testing are continuously experiencing different ways and means to enhance testing efforts and most importantly its cost efficiency. In the short course of Information and Technology evolution, we have seen when tools are incorporated into the production/testing environments the level of output substantially increased.

When we talk about the completeness and the full coverage of testing, it mainly depends on the quality of the data. As testing is the backbone for attaining the quality of the software, test data is the core element in the process of testing.

Strategies for Test Data Management (TDM)

Figure 2: Strategies for Test Data Management (TDM)

Creation of flat files based on the mapping rules. It is always practical to create a subset of the data you need from the production environment where developers designed and coded the application. Indeed, this approach reduces the testers’ efforts of data preparation, and it maximizes the use of the existing resources for avoiding further expenditures.

Typically, we need to create the data or at least identify it based on the type of the requirements each project has in the very beginning.

We can apply the following strategies handling the process of TDM:

  1. Data from the production environment
  2. Retrieving SQL queries that extract data from Client’s existing databases
  3. Automated Data Generation Tools

The testers shall back up their testing with complete data by considering the elements as shown in the figure-3 here. The resters in agile development teams generate the necessary data for executing their test cases. When we talk about test cases, we mean cases for various types of testing like the white box, black box, performance, and security.

At this point, we know that data for performance testing should be able to determine how fast system responds under a given workload to be very much close to real or live large volume of data with significant coverage.

For white box testing, the developers prepare their required data to cover as many branches as possible, all paths in the program source code, and the negative Application Program Interface (API).

Test Data Generation Activities

Figure 3: Test Data Generation Activities

Eventually, we can say that everybody working in the software development life cycle (SDLC) like BAs, Developers and product owners should be well engaged in the process of Test Data preparation. It can be a joint effort. And now let us take you to the issue of corrupted test data.

Corrupted Test Data

Before the execution of any test cases on our existing data, we should make sure that the data is not corrupted/outdated and the application under the test can read the data source. Typically, when more than a tester working on different modules of an AUT in the testing environment at the same time, the chances of data getting corrupted is so high.

In the same environment, the testers modify the existing data as per their need/requirements of the test cases. Mostly, when the testers are done with the data, they leave the data as it is. As soon as the next tester picks up the modified data, and he/she perform another execution of the test, there is a possibility of that particular test failure which is not the code error or defect.

In most cases, this is how data becomes corrupted and/or outdated, which lead to failure. To avoid and minimize the chances of data discrepancy, we can apply the solutions as below. And of course, you can add more solutions at the end of this tutorial in the comments section.

  1. Having the backup of your data
  2. Return your modified data to its original state
  3. Data division among the testers
  4. Keep the data warehouse administrator updated for any data change/modification

How to keep your data intact in any test environment?

Most of the times, many testers are responsible for testing the same build. In this case, more than one tester will be having access to common data and they will try to manipulate the common data set according to their needs.

If you have prepared data for some specific modules then the best way to keep your data set intact is to keep backup copies of the same.

Test Data for the Performance Test Case

Performance tests require a very large data set. Sometimes creating data manually will not detect some subtle bugs that may only be caught by actual data created by application under test. If you want real-time data, which is impossible to create manually, then ask your lead/manager to make it available from the live environment.

This data will be useful to ensure the smooth functioning of application for all valid inputs.

What is the ideal test data?

Data can be said to be ideal if for the minimum size of data set all the application errors to get identified. Try to prepare data that will incorporate all application functionality, but not exceeding cost and time constraint for preparing data and running tests.

How to Prepare Data that will Ensure Maximum Test Coverage?

Design your data considering the following categories:

1) No data: Run your test cases on blank or default data. See if proper error messages are generated.

2) Valid data set: Create it to check if the application is functioning as per requirements and valid input data is properly saved in database or files.

3) Invalid data set: Prepare invalid data set to check application behavior for negative values, alphanumeric string inputs.

4) Illegal data format: Make one data set of illegal data format. The system should not accept data in an invalid or illegal format. Also, check proper error messages are generated.

5) Boundary Condition dataset: Dataset containing out of range data. Identify application boundary cases and prepare data set that will cover lower as well as upper boundary conditions.

6) The dataset for performance, load and stress testing: This data set should be large in volume.

This way creating separate datasets for each test condition will ensure complete test coverage.

Data for Black Box Testing

The Quality Assurance Testers perform integration testing, system testing and the acceptance testing, which is known as black box testing. In this method of the testing, the testers do not have any work in the internal structure, design and the code of the application under the test.

The testers’ primary purpose is to identify and locate errors. By doing so, we apply either functional or non-functional testing using different techniques of black box testing.

Black Box Data Design Methods

Figure 4: Black Box Data Design Methods

At this point, the testers need the test data as input for executing and implementing the techniques of the black box testing. And the testers should prepare the data that will examine all application functionality with not exceeding the given cost and the time.

We can design the data for our test cases considering data set categories like no data, valid data, Invalid data, illegal data format, boundary condition data, equivalence partition, decision data table, state transition data, and use case data. Before going into the data set categories, the testers initiate data gathering and analyzing of the existing resources of the application under tester (AUT).

According to the earlier points mentioned about keeping your data warehouse always up to date, you should document the data requirements at the test-case level and mark them useable or non-reusable when you script your test cases. It helps you the data required for testing is well-cleared and documented from the very beginning that you could reference for your further use later.

Test Data Example for Open EMR AUT

For our current tutorial, we have the Open EMR as the Application Under Test (AUT).

=> Please find the link for Open EMR application here for your reference/practice.

The table below illustrates pretty much a sample of the data requirement gathering that can be part of the test case documentation and is updated when you write the test cases for your test scenarios.

(NOTE: Click on any image for an enlarged view)

test data requirement gathering 1

Creation of manual data for testing Open EMR application

Let’s step forward to the creation of manual data for testing the Open EMR application for the given data set categories.

1) No Data: The tester validates Open EMR application URL and the “Search or Add Patient” functions with giving no data.

2) Valid Data: The tester validates Open EMR application URL and the “Search or Add Patient” function with giving Valid data.

3) Invalid Data: The tester validates Open EMR application URL and the “Search or Add Patient” function with giving invalid data.

4) Illegal Data Format: The tester validates Open EMR application URL and the “Search or Add Patient” function with giving invalid data.

Test Data for 1-4 data set categories:

Test data samples 2

5) Boundary Condition Data Set: It is to determine input values for boundaries that are either inside or outside of the given values as data.

6) Equivalence Partition Data Set: It is the testing technique that divides your input data into the input values of valid and invalid.

Test Data for 5th and 6thdata set categories, which is for Open EMR username and password:

  Open EMT test data 3

7) Decision Table Data Set: It is the technique for qualifying your data with a combination of inputs to produce various results. This method of black box testing helps you to reduce your testing efforts in verifying each and every combination of test data. Additionally, this technique can ensure you for the complete test coverage.

Please see below the decision table data set for Open EMR application’s username and the password.

Decision Table data 4

The calculation of the combinations done in the table above is described for your detailed information as below. You may need it when you do more than four combinations.

  • Number of combination = Number of Conditions 1 Values * Number of Conditions 2 Values
  • Number of combinations = 2 ^ Number of True/False Conditions
  • Example: Number of combinations – 2^2 = 4

8) State Transition Test Data Set: It is the testing technique that helps you to validate the state transition of the Application Under Test (AUT) by providing the system with the input conditions.

For example, we log in the Open EMR application by providing the correct username and the password at first attempt. The system gives us access, but if we enter the incorrect login data, the system denies access. State transition testing validates that how many logins attempts you can do before Open EMR closes.

The table below indicates how either the correct or the incorrect attempts of login respond

Login Attempts test data 5

9) Use Case Test Date: It is the testing method that identifies our test cases capturing the end to end testing of a particular feature.

Example, Open EMR Login:

Success Failure Test Data 6

Also read => Data data management techniques

Properties of a Good Test Data

As a tester, you have to test the ‘Examination Results’ module of the website of a university. Consider that the whole application has been integrated and it is in ‘Ready for Testing’ state. ‘Examination Module’ is linked with ‘Registration’, ‘Courses’ and ‘Finance’ modules.

Assume that you have adequate information about the application and you created a comprehensive list of test scenarios. Now you have to design, document and execute these test cases. In ‘Actions/Steps’ or ‘Test Inputs’ section of the test cases, you will have to mention the acceptable data as input for the test.

The data mentioned in test cases must be selected properly. The accuracy of ‘Actual Results’ column of Test Case Document is primarily dependent upon the test data. So, step to prepare the input test data is significantly important. Thus, here is my rundown on “DB Testing – Test Data Preparation Strategies”.

Test Data Properties

The test data should be selected precisely and it must possess the following four qualities:

1) Realistic:

By realistic, it means the data should be accurate in the context of real-life scenarios. For example, in order to test the ‘Age’ field, all the values should be positive and 18 or above. It is quite obvious that the candidates for admission in the university are usually 18 years old (this might be defined differently in terms of business requirements).

If testing is done by using the realistic test data, then it will make the app more robust as most of the possible bugs can be captured using realistic data. Another advantage of realistic data is its reusability which saves our time & effort for creating new data again and again.

When we are talking about realistic data, I would like to introduce you to the concept of the golden data set. A golden data set is the one which covers almost all the possible scenarios that occur in the real project. By using the GDS, we can provide maximum test coverage. I use the GDS for doing regression testing in my organization and this helps me to test all possible scenarios that can occur if the code goes in the production box.

There are a lot of test data generator tools available in the market that analyze the column characteristics and user definitions in the database and based on these, they generate realistic test data for you. Few of the good examples of the tools that generate data for database testing are DTM Data Generator, SQL Data Generator and Mockaroo.

2. Practically valid:

This is similar to realistic but not the same. This property is more related to the business logic of AUT e.g. value 60 is realistic in the age field but practically invalid for a candidate of Graduation or even Masters Programs. In this case, a valid range would be 18-25 years (this might be defined in requirements).

3. Versatile to cover scenarios:

There may be several subsequent conditions in a single scenario, so choose the data shrewdly to cover maximum aspects of a single scenario with the minimum set of data, e.g. while creating test data for result module, do not only consider the case of regular students who are smoothly completing their program. Give attention to the students who are repeating the same course and belong to different semesters or even different programs. The dataset may look like this:


There might be several other interesting and tricky sub-conditions. E.g. the limitation of years to complete a degree program, passing a prerequisite course for registering a course, maximum no. of courses a student may enroll in a single semester etc. etc. Make sure to cover all these scenarios wisely with the finite set of data.

good test data

4. Exceptional data (if applicable/required):

There may be certain exceptional scenarios that occur less frequently but demand high attention when occurred, e.g. disabled students related issues.

Another good explanation & example of the exceptional data set is seen in the image below:

Exceptional data


A test data is known as good test data if it is realistic, valid and versatile. It is an added advantage if the data provides coverage for exceptional scenarios as well.

Test data preparation techniques

We have briefly discussed the important properties of test data and it has also elaborated how test data selection is important while doing the database testing. Now let’s discuss the techniques to prepare test data.

There are only two ways to prepare test data:

Method #1) Insert New Data

Get a clean DB and insert all the data as specified in your test cases. Once, all your required and desired data has been entered, start executing your test cases and fill ‘Pass/Fail’ columns by comparing the ‘Actual Output’ with ‘Expected Output’. Sounds simple, right? But wait, it’s not that simple.

Few essential and critical concerns are as follows:

  • An empty instance of the database may not be available
  • Inserted test data may be insufficient for testing some cases like performance and load testing.
  • Inserting the required test data into blank DB is not an easy job due to the database table dependencies. Because of this inevitable restriction, data insertion can become a difficult task for the tester.
  • Insertion of limited test data (just according to the test case’s needs) may hide some issues that could be found only with the large data set.
  • For data insertion, complex queries and/or procedures may be required, and for this sufficient assistance or help from the DB developer(s) would be necessary.

Above mentioned five issues are the most critical and the most obvious drawbacks of this technique for test data preparation. But, there are some advantages as well:

  • Execution of TCs becomes more efficient as the DB has the required data only.
  • Bugs isolation requires no time as only the data specified in test cases is present in the DB.
  • Less time required for testing and results comparison.
  • Clutter-free test process

Method #2) Choose sample data subset from actual DB data

This is a feasible and more practical technique for test data preparation. However, it requires sound technical skills and demands detailed knowledge of DB Schema and SQL. In this method, you need to copy and use production data by replacing some field values by dummy values. This is the best data subset for your testing as it represents the production data.  But this may not be feasible all the time due to data security and privacy issues.


In the above section, we have discussed above the test data preparation techniques. In short, there are two techniques – either create fresh data or select a subset from already existing data. Both need to be done in a way that the selected data provides coverage for various test scenarios mainly valid & invalid test, performance test, and null test.

In the last section, let us take a quick tour of data generation approaches as well. These approaches are helpful when we need to generate new data.

Test Data Generation Approaches:

  • Manual Test data generation: In this approach, the test data is manually entered by testers as per the test case requirements. It is a time taking the process and also prone to errors.
  • Automated Test Data generation: This is done with the help of data generation tools. The main advantage of this approach is its speed and accuracy. However, it comes at a higher cost than manual test data generation.
  • Back-end data injection: This is done through SQL queries. This approach can also update the existing data in the database. It is speedy & efficient but should be implemented very carefully so that the existing database does not get corrupted.
  • Using Third Party Tools: There are tools available in the market that first understand your test scenarios and then generate or inject data accordingly to provide wide test coverage. These tools are accurate as they are customized as per the business needs. But, they are quite costly.


There are 4 approaches to test data generation:

  1. manual,
  2. automation,
  3. back-end data injection,
  4. and third-party tools.

Each approach has its own pros and cons. You should select the approach that satisfies your business and testing needs.


Creating complete software test data in compliance with the industry standards, legislation and the baseline documents of the undertaken project is amongst the core responsibilities of the testers. The more we efficiently manage the test data, the more we can deploy reasonably bug-free products for real-world users.

Test data management (TDM) is the process that is based on the analysis of challenges and introducing plus applying the best tools and methods to well address the identified issues without compromising the reliability and the full coverage of the end output (product).

We always need to come up with questions for searching innovative and more cost-effective methods for analyzing and selecting the methods of testing, including the use of tools for generating the data. It is widely proven that well-designed data allows us to identify defects of the application under the test in every phase of a multi-phase SDLC.

We need to be creative and participating with all the members within and outside our agile team. Please share your feedback, experience, questions, and comments so that we could keep up our technical discussions on-going to maximize our positive impact on AUT by managing data.

Preparing proper test data is a core part of the “project test environment setup”. We can’t simply miss the test case saying that complete data was not available for testing. The tester should create his/her own test data additional to the existing standard production data. Your data set should be ideal in terms of cost and time.

Be creative, use your own skill and judgments to create different data sets instead of relying on standard production data.

Part II – The second part of this tutorial is on the Test Data Generation with GEDIS Studio Online Tool”.

Have you faced the problem of incomplete test data for testing? How you managed it? Please share your tips, experience, comments, and questions for further enriching this topic of discussion.

Related Post

Leave a Reply

Your email address will not be published.