Cassandra.pdf - 2n d Ed iti on Cassandra The Definitive...

This preview shows page 1 out of 369 pages.

You've reached the end of your free preview.

Want to read all 369 pages?

Unformatted text preview: 2n d Ed iti on Cassandra The Definitive Guide DISTRIBUTED DATA AT WEB SCALE Jeff Carpenter & Eben Hewitt SECOND EDITION Cassandra: The Definitive Guide Jeff Carpenter and Eben Hewitt Beijing Boston Farnham Sebastopol Tokyo Cassandra: The Definitive Guide by Jeff Carpenter and Eben Hewitt Copyright © 2016 Jeff Carpenter, Eben Hewitt. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( ). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or [email protected] Editors: Mike Loukides and Marie Beaugureau Production Editor: Colleen Cole Copyeditor: Jasmine Kwityn Proofreader: James Fraleigh Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest Second Edition June 2016: Revision History for the Second Edition 2010-11-12: 2016-06-27: 2017-04-07: First Release Second Release Third Release See for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Cassandra: The Definitive Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-93366-4 [LSI] This book is dedicated to my sweetheart, Alison Brown. I can hear the sound of violins, long before it begins. —E.H. For Stephanie, my inspiration, unfailing support, and the love of my life. —J.C. Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1. Beyond Relational Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What’s Wrong with Relational Databases? A Quick Review of Relational Databases RDBMSs: The Awesome and the Not-So-Much Web Scale The Rise of NoSQL Summary 1 5 5 12 13 15 2. Introducing Cassandra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 The Cassandra Elevator Pitch Cassandra in 50 Words or Less Distributed and Decentralized Elastic Scalability High Availability and Fault Tolerance Tuneable Consistency Brewer’s CAP Theorem Row-Oriented High Performance Where Did Cassandra Come From? Release History Is Cassandra a Good Fit for My Project? Large Deployments 17 17 18 19 19 20 23 26 28 28 30 35 35 v Lots of Writes, Statistics, and Analysis Geographical Distribution Evolving Applications Getting Involved Summary 36 36 36 36 38 3. Installing Cassandra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Installing the Apache Distribution Extracting the Download What’s In There? Building from Source Additional Build Targets Running Cassandra On Windows On Linux Starting the Server Stopping Cassandra Other Cassandra Distributions Running the CQL Shell Basic cqlsh Commands cqlsh Help Describing the Environment in cqlsh Creating a Keyspace and Table in cqlsh Writing and Reading Data in cqlsh Summary 39 39 40 41 43 43 44 45 45 47 48 49 50 50 51 52 55 56 4. The Cassandra Query Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 The Relational Data Model Cassandra’s Data Model Clusters Keyspaces Tables Columns CQL Types Numeric Data Types Textual Data Types Time and Identity Data Types Other Simple Data Types Collections User-Defined Types Secondary Indexes Summary vi | Table of Contents 57 58 61 61 61 63 65 66 67 67 69 70 73 76 78 5. Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Conceptual Data Modeling RDBMS Design Design Differences Between RDBMS and Cassandra Defining Application Queries Logical Data Modeling Hotel Logical Data Model Reservation Logical Data Model Physical Data Modeling Hotel Physical Data Model Reservation Physical Data Model Materialized Views Evaluating and Refining Calculating Partition Size Calculating Size on Disk Breaking Up Large Partitions Defining Database Schema DataStax DevCenter Summary 79 80 81 84 85 87 89 91 92 93 94 96 96 97 99 100 102 103 6. The Cassandra Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Data Centers and Racks Gossip and Failure Detection Snitches Rings and Tokens Virtual Nodes Partitioners Replication Strategies Consistency Levels Queries and Coordinator Nodes Memtables, SSTables, and Commit Logs Caching Hinted Handoff Lightweight Transactions and Paxos Tombstones Bloom Filters Compaction Anti-Entropy, Repair, and Merkle Trees Staged Event-Driven Architecture (SEDA) Managers and Services Cassandra Daemon Storage Engine 105 106 108 109 110 111 112 113 114 115 117 117 118 120 120 121 122 124 125 125 126 Table of Contents | vii Storage Service Storage Proxy Messaging Service Stream Manager CQL Native Transport Server System Keyspaces Summary 126 126 127 127 127 128 130 7. Configuring Cassandra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Cassandra Cluster Manager Creating a Cluster Seed Nodes Partitioners Murmur3 Partitioner Random Partitioner Order-Preserving Partitioner ByteOrderedPartitioner Snitches Simple Snitch Property File Snitch Gossiping Property File Snitch Rack Inferring Snitch Cloud Snitches Dynamic Snitch Node Configuration Tokens and Virtual Nodes Network Interfaces Data Storage Startup and JVM Settings Adding Nodes to a Cluster Dynamic Ring Participation Replication Strategies SimpleStrategy NetworkTopologyStrategy Changing the Replication Factor Summary 131 132 135 136 136 137 137 137 138 138 138 139 139 140 140 140 141 142 143 144 144 146 147 147 148 150 150 8. Clients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Hector, Astyanax, and Other Legacy Clients DataStax Java Driver Development Environment Configuration Clusters and Contact Points viii | Table of Contents 151 152 152 153 Sessions and Connection Pooling Statements Policies Metadata Debugging and Monitoring DataStax Python Driver DataStax Node.js Driver DataStax Ruby Driver DataStax C# Driver DataStax C/C++ Driver DataStax PHP Driver Summary 155 156 164 167 171 172 173 174 175 176 177 177 9. Reading and Writing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Writing Write Consistency Levels The Cassandra Write Path Writing Files to Disk Lightweight Transactions Batches Reading Read Consistency Levels The Cassandra Read Path Read Repair Range Queries, Ordering and Filtering Functions and Aggregates Paging Speculative Retry Deleting Summary 179 180 181 183 185 188 190 191 192 195 195 198 202 205 205 206 10. Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Logging Tailing Examining Log Files Monitoring Cassandra with JMX Connecting to Cassandra via JConsole Overview of MBeans Cassandra’s MBeans Database MBeans Networking MBeans Metrics MBeans 207 209 210 211 213 215 219 222 226 227 Table of Contents | ix Threading MBeans Service MBeans Security MBeans Monitoring with nodetool Getting Cluster Information Getting Statistics Summary 228 228 228 229 230 232 234 11. Maintenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Health Check Basic Maintenance Flush Cleanup Repair Rebuilding Indexes Moving Tokens Adding Nodes Adding Nodes to an Existing Data Center Adding a Data Center to a Cluster Handling Node Failure Repairing Nodes Replacing Nodes Removing Nodes Upgrading Cassandra Backup and Recovery Taking a Snapshot Clearing a Snapshot Enabling Incremental Backup Restoring from Snapshot SSTable Utilities Maintenance Tools DataStax OpsCenter Netflix Priam Summary 235 236 236 237 238 242 243 243 243 244 246 246 247 248 251 252 253 255 255 255 256 257 257 260 260 12. Performance Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Managing Performance Setting Performance Goals Monitoring Performance Analyzing Performance Issues Tracing Tuning Methodology x | Table of Contents 261 261 262 264 265 268 Caching Key Cache Row Cache Counter Cache Saved Cache Settings Memtables Commit Logs SSTables Hinted Handoff Compaction Concurrency and Threading Networking and Timeouts JVM Settings Memory Garbage Collection Using cassandra-stress Summary 268 269 269 270 270 271 272 273 274 275 278 279 280 281 281 283 286 13. Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Authentication and Authorization Password Authenticator Using CassandraAuthorizer Role-Based Access Control Encryption SSL, TLS, and Certificates Node-to-Node Encryption Client-to-Node Encryption JMX Security Securing JMX Access Security MBeans Summary 289 289 292 293 294 295 296 298 299 299 301 301 14. Deploying and Integrating. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Planning a Cluster Deployment Sizing Your Cluster Selecting Instances Storage Network Cloud Deployment Amazon Web Services Microsoft Azure Google Cloud Platform 303 303 305 306 307 308 308 310 311 Table of Contents | xi Integrations Apache Lucene, SOLR, and Elasticsearch Apache Hadoop Apache Spark Summary 312 312 312 313 319 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 xii | Table of Contents Foreword Cassandra was open-sourced by Facebook in July 2008. This original version of Cassandra was written primarily by an ex-employee from Amazon and one from Microsoft. It was strongly influenced by Dynamo, Amazon’s pioneering distributed key/value database. Cassandra implements a Dynamo-style replication model with no single point of failure, but adds a more powerful “column family” data model. I became involved in December of that year, when Rackspace asked me to build them a scalable database. This was good timing, because all of today’s important open source scalable databases were available for evaluation. Despite initially having only a single major use case, Cassandra’s underlying architecture was the strongest, and I directed my efforts toward improving the code and building a community. Cassandra was accepted into the Apache Incubator, and by the time it graduated in March 2010, it had become a true open source success story, with committers from Rackspace, Digg, Twitter, and other companies that wouldn’t have written their own database from scratch, but together built something important. Today’s Cassandra is much more than the early system that powered (and still pow‐ ers) Facebook’s inbox search; it has become “the hands-down winner for transaction processing performance,” to quote Tony Bain, with a deserved reputation for reliabil‐ ity and performance at scale. As Cassandra matured and began attracting more mainstream users, it became clear that there was a need for commercial support; thus, Matt Pfeil and I cofounded Rip‐ tano in April 2010. Helping drive Cassandra adoption has been very rewarding, espe‐ cially seeing the uses that don’t get discussed in public. Another need has been a book like this one. Like many open source projects, Cassan‐ dra’s documentation has historically been weak. And even when the documentation ultimately improves, a book-length treatment like this will remain useful. xiii Thanks to Eben for tackling the difficult task of distilling the art and science of devel‐ oping against and deploying Cassandra. You, the reader, have the opportunity to learn these new concepts in an organized fashion. — Jonathan Ellis Project Chair, Apache Cassandra, and Cofounder and CTO, DataStax xiv | Foreword Foreword I am so excited to be writing the foreword for the new edition of Cassandra: The Definitive Guide. Why? Because there is a new edition! When the original version of this book was written, Apache Cassandra was a brand new project. Over the years, so much has changed that users from that time would barely recognize the database today. It’s notoriously hard to keep track of fast moving projects like Apache Cassan‐ dra, and I’m very thankful to Jeff for taking on this task and communicating the latest to the world. One of the most important updates to the new edition is the content on modeling your data. I have said this many times in public: a data model can be the difference between a successful Apache Cassandra project and a failed one. A good portion of this book is now devoted to understanding how to do it right. Operations folks, you haven’t been left out either. Modern Apache Cassandra includes things such as virtual nodes and many new options to maintain data consistency, which are all explained in the second edition. There’s so much ground to cover—it’s a good thing you got the definitive guide! Whatever your focus, you have made a great choice in learning more about Apache Cassandra. There is no better time to add this skill to your toolbox. Or, for experi‐ enced users, maintaining your knowledge by keeping current with changes will give you an edge. As recent surveys have shown, Apache Cassandra skills are some of the highest paying and most sought after in the world of application development and infrastructure. This also shows a very clear trend in our industry. When organiza‐ tions need a highly scaling, always-on, multi-datacenter database, you can’t find a bet‐ ter choice than Apache Cassandra. A quick search will yield hundreds of companies that have staked their success on our favorite database. This trust is well founded, as you will see as you read on. As applications are moving to the cloud by default, Cas‐ sandra keeps up with dynamic and global data needs. This book will teach you why and how to apply it in your application. Build something amazing and be yet another success story. xv And finally, I invite you to join our thriving Apache Cassandra community. World‐ wide, the community has been one of the strongest non-technical assets for new users. We are lucky to have a thriving Cassandra community, and collaboration among our members has made Apache Cassandra a stronger database. There are many ways you can participate. You can start with simple things like attending meet‐ ups or conferences, where you can network with your peers. Eventually you may want to make more involved contributions like writing blog posts or giving presenta‐ tions, which can add to the group intelligence and help new users following behind you. And, the most critical part of an open source project, make technical contribu‐ tions. Write some code to fix a bug or add a feature. Submit a bug report or feature request in a JIRA. These contributions are a great measurement of the health and vibrancy of a project. You don’t need any special status, just create an account and go! And when you need help, refer back to this book, or reach out to our community. We are here to help you be successful. Excited yet? Good! Enough of me talking, it’s time for you to turn the page and start learning. — Patrick McFadin Chief Evangelist for Apache Cassandra, DataStax xvi | Foreword Preface Why Apache Cassandra? Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems (RDBMSs). Cassandra first started as an Incubator project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and have steadily made releases ever since. Cassan‐ dra is being used in production by some of the biggest companies on the Web, includ‐ ing Facebook, Twitter, and Netflix. Its popularity is due in large part to the outstanding technical features it provides. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s no single point of failure. It is highly available and offers a data model based on the Cassandra Query Language (CQL). Is This Book for You? This book is intended for a variety of audiences. It should be useful to you if you are: • A developer working with large-scale, high-volume applications, such as Web 2.0 social applications or ecommerce sites • An application architect or data architect who needs to understand the available options for high-performance, decentralized, elastic data stores • A database administrator or database developer currently working with standard relational database systems who needs to understand how to implement a faulttolerant, eventually consistent data store xvii • A manager who wants to understand the advantages (and disadvantages) of Cas‐ sandra and related columnar databases to help make decisions about technology strategy • A student, analyst, or researcher who is designing a project related to Cassandra or other non-relational data store options This book is a technical guide. In many ways, Cassandra represents a new way of thinking about data. Many developers who gained their professional chops in the last 15–20 years have become well versed in thinking about data in purely relational or object-oriented terms. Cassandra’s data model is very different and can be difficult to wrap your mind around at first, especially for those of us with entrenched ideas about what a database is (and should be). Using Cassandra does not mean that you have to be a Java developer. However, Cas‐ sandra is written in Java, so if you’re going to dive into the source code, a solid under‐ standing of Java is crucial. Although it’s not strictly necessary to know Java, it can help you to better understand exceptions, how to build the source code, and how to use some of the popular clients. Many of the examples in this book are in Java. But because of the interface used to access Cassandra, you can use Cassandra from a wide variety of languages, including C#, Python, node.js, PHP, and Ruby. Finally, it is assumed that you have a good understanding of how the Web works, can use an integrated development environment (IDE), and are somewhat familiar with the typical concerns of data-driven applications. You might be a well-seasoned devel‐ oper or administrator but still, on occasion, encounter tools used in the Cassandra world that you’re not familiar with. For example, Apache Ant is used to build Cassan‐ dra, and the Cassandra source code is available via Git. In ca...
View Full Document

  • Fall '19

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask ( soon) You can ask (will expire )
Answers in as fast as 15 minutes
A+ icon
Ask Expert Tutors