Esri ArcUser Magazine July-Sept. 1998 -- Integrating Quality Assurance into the GIS Project Life Cycle

GIS databases evolve constantly. From paper maps through the digital conversion process to data maintained in a database, GIS data are being constantly transformed. Maintaining the integrity and accuracy of these data through a well-designed quality assurance (QA) plan that integrates the data conversion and maintenance phases is key in implementing a successful GIS project.

Poor data negate the usefulness of the technology. Sophisticated software and advanced hardware cannot accomplish anything without specific, reliable, accurate geographic data. GIS technology requires clean data. To maximize the quality of GIS databases, a quality assurance plan must be integrated with all aspects of the GIS project.

The fundamentals of quality assurance never change. Completeness, validity, logical consistency, physical consistency, referential integrity, and positional accuracy are the cornerstones of the QA plan. All well-designed QA strategies must coexist within the processes that create and maintain the data and must incorporate key elements from the classic QA categories. If QA is not integrated within the GIS project, QA procedures can themselves become sources of error.

Categories of Quality Assurance

> Completeness means the data adhere to the database design. All data must conform to a known standard for topology, table structure, precision, projection, and other data model specific requirements.

> Validity measures the attribute accuracy of the database. Each attribute must have a defined domain and range. The domain is the set of all legal values for the attribute. The range is the set of values within which the data must fall.

> Logical consistency measures the interaction between the values of two or more functionally related attributes. If the value of one attribute changes, the values of functionally related attributes must also change. For example, in a database in which the attribute SLOPE and the attribute LANDUSE are related, if LANDUSE value is "water," then SLOPE must be 0, as any other value for SLOPE would be illogical.

> Physical consistency measures the topological correctness and geographic extent of the database. For example, the requirement that all electrical transformers in an electrical distribution database's GIS have annotation denoting phasing placed within 15 feet of the transformer object is one that describes a physically consistent spatial requirement.

> Referential integrity measures the associativity of related tables based upon their primary and foreign key relationships. Primary and foreign keys must exist and must associated sets of data in the tables given predefined rules for each table.

> Positional accuracy measures how well each spatial object's position in the database matches reality. Positional error can be introduced in many ways. Incorrect cartographic interpretation, through insufficient densification of vertices in line segments, or digital storage precision inadequacies are just a couple sources of positional inaccuracies. These errors can be reandom, systematic, and/or cumulative in nature. Positional accuracy must always be qualified because the map is a representation of reality.

Matthew McCain

William C. Masters

The following section outlines general stages of GIS database creation from an existing paper map to a seamless, continually maintained database and how a QA plan is integrated at each stage.

GIS Database Creation
Map Preparation Not all maps are ready to be digitized in their original state. Map scrub or the preparation of paper maps is the foundation for subsequent steps of the data conversion process. Detecting and correcting errors can be done most cost-effectively at this phase.

Control Review
Benchmarks, corner tic marks, or other surveyed locations, visible and identifiable on the map source, are used to establish coordinate control for the database. Verify the known real-world location for every control point. This is the most important step in the data conversion process-every dollar spent on coordinate control is worth at least two dollars spent later dealing with positional accuracy problems.

Edgematch Review
Edgematching, another critical component of the map preparation process, requires that all features that cross or are near the map edge be reviewed with respect to logical and physical consistency requirements as well as noted for positional accuracy and duplication. The temporal factor must be considered. If adjacent maps differ greatly in age there are bound to be edgematching errors between these maps. Cultural features are especially prone to this problem.

Primary Key Validation
Map information that will be converted to key information must be closely reviewed. For example, if a fire hydrant number will be used as a key to relate to other hydrants, the database must be checked so that all fire hydrant features have unique numbers to avoid referential integrity errors.

Conflict Resolution
Conflicts resulting when the same data coming from two or more sources differ must be worked out. Map series must be reviewed together to identify duplicated features and resolve conflicting positional locations and conflicting feature attributes. Think of this as "vertical edgematching."

Data Conversion
Creating digital data from paper map sources does not make the data more accurate. Actually this process introduces more or different types of errors into the data. The goal of high-quality data conversion is to limit the amount of random and systematic error introduced into the database.

Random error will always be a part of any data, regardless of form. Random error can be reduced by tight controls and automated procedures for data entry.

Systematic error, on the other hand, must be removed from the data conversion process. A systematic error usually stems from a procedural problem that, once corrected, usually clears up the systematic error problem.

Random and systematic error can be corrected by checking data automatically and visually at various stages in the conversion cycle. A short feedback loop between the quality assurance and conversion teams speeds the correction of these problems.

RMS Errors
Registration of paper maps or film separates to a digitizing board, or to images with known coordinate locations, introduces registration error. The amount of error is calculated by the Root Mean Square (RMS) method. Each feature digitized into the database will have an introduced error equivalent to the RMS error. The goal during registration is to minimize the RMS error as much as possible. Standards must be set and adhered to during the data conversion process. High RMS errors, in some cases, point to a systematic error such as a poor scanner or poor digitizer calibration.

Visual QA
At various stages in the data conversion process visual QA must be performed. Visually inspecting data can detect systematic errors such as an overall shift in the data caused by an unusually high RMS value as well as random errors such as misspelled text. Visual inspections can detect the presence of an erroneous date, the absence of data, or positional accuracy of data. Visual QA can be performed using hard-copy plots or on-screen views. The hard-copy plotting of data is the best method for checking for missing features, misplaced features, and registration errors. On-screen views are an excellent way to verify that edits to the database were made correctly, but are not a substitute for inspecting plots.

Visual inspection should occur during initial data capture, at feature attribution, and again at final data delivery. At initial data capture, review the data for missing or misplaced features and alignment problems that could point to a systematic error. Each error type needs to be evaluated along with the process that created the data in order to determine the appropriate root cause and solution.

Automated QA
Visual inspection of GIS data is reinforced by automated QA methods. GIS databases can be automatically checked for adherence to database design, attribute accuracy, logical consistency, and referential integrity.

Automated QA must occur in conjunction with visual inspection. Automated quality assurance allows quick inspection of large amounts of data. It will report inconsistencies in the database that may not appear during the visual inspection process. Both random and systematic errors are detected using automated QA procedures. Once again, the feedback loop has to be short in order to correct any flawed data conversion processes.

Data Acceptance
Defining acceptance criteria is probably one of the most troubling segments of the GIS project. Which errors are acceptable? Are certain errors weighted differently than others? What percentage of error constitutes a rejection of data? The answers to these questions are not always obvious and require knowledge of the data model and database design as well as the user needs and application requirements. Project schedule, budget, and human resources all play a role in determining data acceptance.

Acceptance Criteria
Accepting data can be confusing without strict acceptance rules. If a GIS data set has 10 features and each feature contains 10 attributes, what is the percentage of error if one attribute is incorrect? Is it 1 percent or 10 percent (10 x 1 percent)? If you subscribe to the theory that all of the attributes make a feature correct, then the entire feature is in error, making the error percentage 10 percent. On the other hand, if only one attribute is incorrect for a feature, and it is treated as a minor error, then the error percentage is 1 percent because one out of a possible 100 attributes is incorrect.

For a minor attribute the 1 percent error may not be crucial, but what if the attribute in error is a primary key and the error is a nonunique value for that key? This seemingly minor error cascades through the database jeopardizing relationships to one or more tables. Weighting attributes by importance solves this problem. Each attribute should be reviewed to determine if it is a critical attribute and then weighted accordingly.

Additionally, the cartographic aspect of data acceptance should be considered. A feature's position, rotation, and scaling must also be taken into account when calculating the percentage of error, not just its existence or absence.

Error Detection
Once the acceptable percentage of error and the weighting scheme are determined, methods of error detection for data acceptance should be established. These methods are the same as those employed during the data conversion phase. Checkplots are compared to original sources and automated database checking tools are applied to the delivered data. Very large databases may require random sampling for data acceptance.

Data Maintenance
The data maintenance stage of the project life cycle begins once the database has been accepted. GIS data maintenance involves additions, deletions, and updates to the database. These changes must be done in a tightly controlled environment in order to retain the database's integrity.

Controlling the Environment
Data-specific application programs generally are written to control the maintenance of a GIS database. These applications play two roles in the database update. First, they automate the database update process. Second, they control the methods employed to update the database.

Rigid control provides the user with only one point of entry into the database, which improves the consistency and security of the database. Maintenance applications rely upon a set of business rules that define the features, relationships between features, and update methods.

Maintenance applications are very dependent upon the static database design and the degree to which the database conforms to the design. These applications are usually supported by a database management system composed of permanent and local or temporary storage systems.

Data are checked out from permanent storage, copied in local storage for update, and then posted back to the permanent storage to complete the update. In this environment, preposting QA checks are required to ensure database integrity. The data storage manager must maintain the database schema so that table structure and spatial data topologies are not destroyed. Automated validation of attribute values should also be a part of the prepost QA. Visual checkplots can be useful when large amounts of data are either added or removed.

Scheduled Validation
Scheduled database validation, a must for large multiuser databases, is similar to a 60,000-mile checkup for your car. Early identification of important and potentially costly errors represents real savings. It is always cheaper to fix a bad process than to correct hundreds or even thousands of errors that may have been introduced into a database. These errors may point to a lack of control during the database update process, may be errors caused by last-minute changes in business rules, could be bugs in the maintenance application, or may be caused by inconsistent editing methods. All these errors can be detected during scheduled validation.

Is Quality Assurance Worth the Cost?
High-quality data sets support high-quality analysis. Software applications that interact with and manipulate GIS data make certain assumptions about the quality of the data. Spatial overlay, a classic method of spatial analysis, requires good registration between layers, as well as high attribute accuracy and consistency. GIS analysis using poor data yields poor analysis.

High-quality GIS databases facilitate sharing. Without some assurance of cleanliness, marketing or sharing data is difficult. A high-quality database with solid quality statistics helps break down barriers to data sharing.

The decision making process supported by GIS is frequently liable for its decisions. The results of locational analysis such as floodplain evaluation or hazardous waste siting can be disastrous with poor data. Critical customer service information, such as medical priority in facilities management or addressing in E-911 databases, can mean the difference between life and death.

In the past, issues of quality were often minimized because of the additional short-term costs associated with quality assurance. The average GIS database is very expensive to create and maintain. The reluctance to incur additional cost is understandable. However, the potential cost of poor analysis, application revision, and data reconstruction caused when QA is shortchanged in the project implementation far outweighs the initial cost of a well-designed and well-executed QA plan. Protecting the organization's investment in its data is as prudent as insuring a home or business against catastrophe.

About the Authors
Matthew McCain, who has a bachelor's degree in cartography from the University of Wisconsin at Madison, is Vice President of Dog Creek Design and Consulting, Inc., and lives outside Boulder, Colorado. McCain worked for Esri for seven years in Redlands, California, and Denver, Colorado; while at Esri McCain helped manage the automation of the Digital Chart of the World and contributed to the success of several large AM/FM/GIS data conversion and acceptance projects.

Bill Masters is the president of Dog Creek Design and Consulting, Inc., and lives outside Milwaukee, Wisconsin. Masters has a bachelor's degree in Engineering Physics from the University of Arizona. Masters was the senior programmer on Esri's Digital Chart of the World project. Prior to that, he was a photogrammetric analyst in the Air Force, worked at Oak Ridge National Laboratory, and worked as a senior analyst at a major data conversion vendor.