24 Oct 2017

Wide Column NoSQL VS. RDMS

By Larry Sitka

In the Enterprise Imaging market today, Windows and SQL Server Relational Database Management System have become the mainstream players underpinning implementation-specific technologies popular in PACS and VNA deployments.   As the United States progresses toward the goal of a “Learning Health System” as defined by the Office of the National Coordinator (ONC) for 2022, we see a much wider breadth and more flexible set of requirements that are changing how the industry will implement both PACS and VNA or VNA-like technology inside the healthcare delivery organizations.  Scrutiny must be placed on implementations that can meet the new demands of lowering implementation and support costs while providing a scalable solution that addresses a Learning Healthcare System.

Below is a summary of key financial and technical benefits of a NoSQL wide-column database implementation:

Financial benefits

  • Hidden DBA FTE costs
  • Verified vendor black-box support
  • Easy licensing model
  • Allows for core expansion and contraction of the database license without downtime

Technical benefits

  • No table locks mean simpler code base
  • Wide-column queries
  • Extensible document datatypes
  • Finally, an answer to the next 1000 users’ questions

 

Cost Removal Benefits of NoSQL Like Cassandra

Access to All Data but No DBA FTE Necessary
Traditional VNA deployments require a Data Base Administrator (DBA) to handle the VNA.  This carries a FTE cost of $150-$200K per year. During the initial stages of VNA deployment, which include primary migration efforts, the DBA is extremely busy given that typical ingest rates can be 10 times that of normal production.  This is true even in cases where a shared RDMS SQL farm exists. Often the DBA is consumed during the first year or two.  Because of the impact of VNA implementation, there is an immense amount of tuning, maintenance, and backup management that will tax existing resources.  Additionally, the organization will have centralized an incredible amount of data in a flexible format and a flat, non-normalized data model that can be leveraged by the emerging analytics tools.  Having an additional DBA resource helps to optimize the intelligence that can be garnered by this information, and requires report writing and analysis when deploying an enterprise imaging platform rooted in the SQL environment.   Or your VNA vendor must provide a black-box service, monitoring and maintaining the database, appropriate security updates, and conflict resolution on your behalf.  Your vendor will still provide complete access to all data content stored inside the NoSQL database, such as Casandra and Elastic.  It’s recommended that organizations use existing third-party tools to access the NoSQL instance such as Hanna/SAP, Hadoop, Kibana, or other overlaid applications.

Maintenance Tasks
Neglecting to check and verify daily mundane maintenance tasks, such as backup plans, index rebuilds, and more can spell disaster. Backups are needed for recovery, and it’s at that point when your DBA realizes the backups were not finishing properly for weeks or months.  Your VNA vendor must work together with local IT staff to assure proper backup of databases and transaction logs.  Your VNA vendor can verify those are working properly and practice a recovery point with the site, thus allowing the site to recover without vendor assistance, should this be desirable.

Licensing Model complexity

Acquiring NoSQL database technology like Cassandra and Elastic is an easy decision to make. You will receive a no-cost licensing model versus the common RDMS licensed models like Microsoft SQL Server, which require the confusion of core count, CAL count, failover and recovery, application access pooling, Cloud, and VM. More importantly, an annual ‘true-up’ is performed by Microsoft. Inside this true-up, it looks for non-conformance licensing. This annual process is intrusive and, in essence, lends itself to a ‘guilty-until-proven-innocent’ model.

VNA migrations within traditional VNA deployments will generate 10 times the traffic which in turn requires 10 times the volume licenses for SQL Server access.  VNA deployments often involve a rush to finish the migrations and is directly linked to decommissioning a PACS application or storage end of life. Once the migration is completed those purchased licenses on a SQL Server are wasted and no longer necessary.

Alternatively, the Cassandra database requires no licensing and hence there is no up-charge for extra licenses.   An administrator can implement the needed Cassandra Elastic VMs to handle the combined production/migration study volumes and remove them when the migration is complete from the Cassandra clusters, simplifying the deployment. Data compression and/or encryption are an operational requirement in today’s VNA deployments on RDMS systems and requires more cores.  If using an RDMS SQL, more cores require larger license counts or even a different product level and thus more cost.   When deploying on Cassandra, compression is inherent and no additional licensing is required when more cores are needed. Encryption of data on the wire and between databases, applications, and nodes typically requires additional CPU cores.  Again, the Cassandra model also requires more cores for encryption but no additional cost.  More cores inside an RDMS database means more cost.

 

NoSQL Cassandra and Elastic Deployment and Architectural Benefits:

Cassandra’s architecture provides inherent High Availability (HA). It is a horizontally scalable, distributed node topology compared to the RDMS model which is typically a primary/secondary datacenter model.

Within the Casandra High Availability (HA) configuration is a concept called a quorum.  The quorum is a cluster consisting of three or more separate physical or VM builds.  This configuration is commonly deployed in an odd-number of separate VMs and physical devices.  When an inbound transaction is received, it is written to a key node and replicated to the remaining nodes according to configuration.  When subsequent changes are made to the same data, they are also written in a similar manner.  When a query is made, all nodes are checked to ensure the most recent version of data is returned correctly, the odd number in the quorum is used to prevent a ‘tie’ in the determination of the correct version of the data being returned.  In simpler terms, a majority of nodes returning a specific version of the data will be the current version of the data.   As Cassandra nodes can be configured to have a three-node cluster in a primary location with other nodes belonging to the same quorum residing in yet another location or even the cloud, the architecture of Cassandra provides an environment that is always ‘on,’ compared to a more primitive primary-to-secondary failover recovery point.

A NoSQL implementation like Cassandra and Elastic are horizontally scalable on demand.  Clustering is made easy.  Cassandra requires no downtime to increase or decrease the quorum count to scale horizontally.  If greater capacity is required, additional nodes are spun up and attached to the existing cluster.  In situations where an initial, larger deployment is required to handle a migration, for example, scaling back and reducing the number of Cassandra nodes when the migration is complete, Cassandra is equally efficient.   In addition, no down time is required. Most RDMS systems require down time and outage for expansion, clustering or retraction of servers.

A more efficient method of VNA data replication is called ‘Meta-data’ replication as compared to the standard DICOM replication model.  In a traditional RDMS deployment, a DICOM replication model is used.  Meta-data replication is inherent in Cassandra as opposed to the DICOM replication services used by most RDMS VNA deployments.    Sharing a quorum across datacenters allows for replication and clustering within those datacenters without slowing traffic.   In addition, your VNA vendor should leverage the storage platform to replicate and synchronize file content while the database metadata is replicated via NoSQL Casandra and Elastic.  This dramatically simplifies the migration deployments and reduces the cost in infrastructure to do those migrations.

Database Insertion
On insertions, RDMS are fixed tuples, meaning they are structured and offer only limited content.  In comparison, a Key Value pair “Wide Column” database in a Cassandra deployment allows for easy insertions of all content.   RDMS are constrained and focus on known traffic and query patterns.  Primary and foreign key relational schemas, index tables and hints are created so the frequent queries can be resolved quickly and table inserts are defined inside a normalized restrictive database schema.   VNAs are traditionally relational or hierarchical given that DICOM is hierarchical.  Most VNAs are deployed inside the RDMS environments. The RDMS fails when queries start going outside the existing column declarations or outside existing prebuilt index tables.  RDMS typically require massive database and software changes, whereas wide-column databases are extensible and fit newer datatypes like JSON and XML natively.  In short, RDMS may have a few dozen column pairs, but wide-column databases can have hundreds of columns.  The wide-column database insertion and quorum architecture of Cassandra allow for the insertion of a row, not from just one location, but from any arrival point inside the quorum, and in parallel. The amount of parallelism is in effect unbounded, using a horizontally scalable approach.  Inserts in RDMS require table locks and stall and recovery/rollback techniques to resolve deadlocks.  In Cassandra, table locks do not occur and therefore do not stall the insertion or require a rollback transaction-specific code to handle retries.

Database Query Capabilities
In short, there are no table locks on insert for Cassandra.   There can be, however, the odd case where going outside the indexed column pairs causes the RDMS model to fail by creating a massive flat line in the CPU and memory utilization. From an analytics perspective, traditional inline query operations on the RDMS happens quickly. However, the data that interests humans and machines lies outside normal parameters, and results in the occasional unbounded, ad-hoc queries.   These outlying queries cause a massive spike in computer resources including database locks on the RDMS systems.  Cassandra provides the capability of no table locks and allows for wide queries.  There are plenty of documented cases where out-of-bound queries on a RDMS system have caused a complete lock on the database until the query is completed or killed.  A frequent call to support services resulting from such a system deadlock occurs and it is discovered that a query was unleashed causing this scenario.  RDMS queries that are unbounded and open-ended singletons with no timeouts are worrisome and a constant risk to system uptimes and poor performance.

Extensible Document Datatypes
RDMS databases now support an XML-like datatype that are restrictive in use and space reservation is required even if no content is provided.  In contrast, Cassandra supports a wide variety of datatypes including document types.   Newer technologies such as XML and JSON fit inside NoSQL databases like Cassandra extremely well.  Finally, expansive datatypes such as direct HL7 or DICOM metadata also fit well inside Cassandra.  It should be noted input and output streams from FHIR are either JSON or XML.

Demand for Analytics and the Next 1000 Users Questions Resolved
Analytics and the demand for data in its frequency will outpace any RDMS systems capability to expand.   Analytics applications require not just daily controlled queries but dynamic as opposed to inline precompiled queries that focus on outliers of information.   The queries run over and over repetitively on sub-second intervals. As those results are analyzed, subsequent unbound and unrestrictive queries happen in the background or even in parallel and these analytic learning machines tunnel down deeper for results focusing on the outlier data points of interest.  The result is a single query that can generate an exhaustive set of additional queries all in parallel and out of normal context.

Who uses Cassandra today? Today’s social networks, cellular companies, or large varying content sites such as Netflix, use Cassandra because of its scalability and its free-form column and unrestrictive approach to managing and accessing data.

In conclusion, there is a clear and compelling case to reconsider the traditional and outmoded approach of Microsoft SQL serving as the underlying database model of today’s expansive enterprise imaging informatics platforms.  Achieving scalability, extreme performance and high availability are crucial to ensure that all data in a healthcare environment can be used for patient contextual diagnosis and care.  Even Microsoft is introducing Azure Elastic services to meet the needs of data flexibility and liquidity. Future HIT requirements are being defined in the ONC today that require a flexible wide-column data model.  The NoSQL database such as Cassandra will allow your organization to meet the demands and enable success.

 

About Larry Sitka:

Vice President & Chief Strategic Information Officer, Enterprise Applications, Canon Medical Informatics, Inc.

  • Masters of Computer Science, NTU; BS Computer Science, University of Wisconsin Madison
  • Founded Acuo Technologies, 1999
  • 27 years of experience in medical imaging
  • Healthcare industry author and speaker
  • Holder of serveral US and EU patents focused on digital asset management across healthcare, printing and publishing, defense mapping and photographic businesses
  • Healthcare strategist and evangelist