Data Usage Dynamics (read-only, volatile, query relevance)
Data Usage Dynamics (read-only, volatile, query relevance)
Conventional business applications tend to deal with small sets of data at a time, such as a Customer record and a specific Order. Engineering and High Performance Computing applications tend to deal with much larger datasets. In data intensive computing much of the data is generated by complex algorithms, rather than generated at human or business system speeds.
It is important to recognize the usage characteristics of groups of data in order to provide adequate I/O and network bandwidth for delivering the data to its most frequent users in a timely manner.
Typical data usage patterns include:
- Read-only.
- Frequently read (such as lookup tables).
- Extremely high read (streaming) rate or concurrency.
- Always read serially/randomly.
- Always read via an index, such as date/time or latitude and longitude square.
- Directly accessed by name or key value.
- Generally accessed by navigation from another object.
- Infrequently updated.
- Frequently updated (serially or randomly).
- Versioned.
- Always written serially.
- Extremely high ingest rate.
- Transient data that is overwritten when more data arrives
- Archive after a specified time or when disk space is required.
- Discard after specified time period or event.
Each of these usage patterns can be dealt with by:
- Selecting appropriate amounts of memory and I/O bandwidth.
- Marking databases as “read-only”.
- Caching and logical/physical (object cluster to file) mapping strategies.
- Object clustering, linking, naming, indexing and versioning policies.
- Segmentation of data streams by time, geography or other significant query parameters.
- Using transient objects or taking advantage of dynamic cache space reuse.
- Using the Objectivity Open File System hooks to the High Performance Storage System [HPSS] or other mass storage devices to transparently stage data off of disk onto archival media.
