Skip to content

Hey, I’ve contributed something (minor) to @apachekafka!

github.com/apache/kafka/c…

I’ve been thinking about migrating my (insignificant) blog to a platform that doesn’t support comments, but just now got a useful comment. 🤔

“US air strike in Syria kills nearly 60 civilians ‘mistaken for Isil fighters’”

telegraph.co.uk/news/2016/07/1…

Fucking horrible. Fuck. Fuck!

Notes from the First Kafka Summit

I learned a lot at the first Kafka Summit, organized by Confluent, and I’d love to try to share some of what I learned. This is a repost from the internal tech blog at Park Assist, and it’s a bit out of date as the conference was in April, but I figure it might have some value to someone stumbling across it.

High-level summary

  • There’s a shit-ton of work going on right now in streaming data, at pretty much every level
  • There are a ton of stream processing (SP) frameworks and libraries being very actively developed:
  • Samza didn’t seem to be present. Curious and worrying. I’ve sent an inquiry to one of its creators.
  • Kafka is working really well for lots of companies
  • But:
    • it still has some rough edges: deployment, operations, management, durability (with default settings)
    • they tend to be larger companies that can invest substantially in infrastructure, tooling, and operations
  • There’s a gap in the market for information, guidance, and services for small and medium companies
  • Heroku is stepping into that gap with a hosted Kafka service.
    • This is very exciting. I’m psyched to try it out.

Session Videos

Confluent has posted videos of every single session — very cool!

Some of my favorites:

And some sessions I missed but plan to watch:

Deployment, Operations, Management, and Tooling; or: Kafka Itself

I didn’t spend much time focused on these topics, but I did learn a few things:

  • larger companies that use Kafka heavily have built lots of tooling around it:
    • schema registries
    • topic registries
    • data dictionaries
    • proxies
    • client libraries
  • lots of companies have made lots of mistakes when using Kafka:
    • large mess of undifferentiated topics
      • no namespacing (e.g. prefixing)
      • no differentiating between public and private topics
    • problematic settings
      • it’s very easy to accidentally lose data
      • the replication defaults are no good (wtf)
  • my conclusions:
    • the cognitive load involved in operating Kafka well is currently high
      • Specifically in the case of a large cluster with robust replication and many topics and many clients
      • I think our use case for on-site installs will be simpler: almost always two machines right next to each other, no AZs, etc
    • Utilizing Kafka and stream processing for a specific app or component or use case is fairly straightforward and as of this moment we have some pretty great options (modern producer, modern consumer, schema registry)
      • But using is as the “nervous system” for an entire company is a whole different thing; you need more documentation, more conventions (e.g. around data evolution), more tooling, etc.

Stream Processing

  • A few different talks covered the principles and models involved in SP
    • there seems to be a converging consensus that we need models, APIs, and DSLs that support processing events based on event time rather than processing time, and for out of order events.
      • Flink (out now), Beam, Spark 2, and Kafka Streams (all coming soon) all support this
    • In addition to aligned windows and sliding windows, with which we’re probably familiar, another window type that was discussed were “session windows” — which could be super useful for our visits.
      • If you squint, then a bay’s visits can be thought of as sessions.
    • The Beam folks are positing that we need models (APIs and DSLs) that have first-class support for early (tentative/speculative) results, on-time results, late results, and corrections.
      • They use something called a watermark to determine when to produce potentially on-time results
      • Each streaming transformation that uses aggregation (which is by necessity windowed) really should include a specification for how to handle “refinements”
      • Kafka Streams seems to support this, albeit maybe in a somewhat rigid way

Kafka Streams

  • SP framework that’s more of a library
  • Combines the Kafka Consumer and Producer along with a sophisticated stream processing DSL to make sophisticated windowing, joining, and aggregate operations accessible
  • Supports event time and unordered events
  • Initial release (as part of Kafka 0.10) is targeted to be some time this summer
  • Vision is to make it radically easier to integrate sophisticated stream processing into apps
  • Because it’s “just a library” you can create SPs and deploy them however you want — deployment is decoupled from use of the library

Beam

  • An open-source fork of the core model behind Google Cloud Dataflow
  • Appear to be aiming for a late summer initial release
  • A core API that’s sort-of a DSL, for expression stream processing operations with a high-level (declarative) syntax that can then be executed by an execution engine (a “runner”)
    • Sort-of like SQL but without the textual language — just a programmatic API for now
  • Includes runners for Cloud Dataflow, Flink, and Spark 1.
  • The Beam Model
    • What » Where » When » How
    • What are you computing
    • Where in event time?
    • When in processing time?
    • How do refinements relate?
  • You specify (declare) those 4 things then the runner interprets them

Spark 2

  • Under active development right now, shooting for release next month
  • Introduces a new unified API that unifies processing of bounded and unbounded data, supports event-time processing, and unordered events
  • Support for Kafka is coming soon, but might not be released at exactly the same time as Spark 2 itself
    • They’re considering unbundling it from the core Spark codebase and releasing it as a plugin

I didn’t actually attend the Flink session, but it was mentioned many times in many different sessions.

Flink is on my radar for a few reasons:

  • It aims to be a comprehensive framework covering both bounded and unbounded data processing (just as Spark 2)
  • It supports event time and unordered events
  • Apache Beam includes a Flink runner that supports almost all of Beam’s semantics
    • Seems to me like this might mean that it might be “easy” to migrate an SP operation from Flink’s API to Beam’s
  • It recently hit 1.0 so it should be more mature than both Kafka Streams and Spark 2

Conclusion

I’m more convinced than ever that Kafka, and the paradigm it embodies, can make our systems radically simpler, faster, more maintainable, and more agile. It’s still early days and it’s going to take time and hard work to realize that potential. Thankfully there’s a large, energized, robust community putting in that time and hard work to make it happen.

tags:

Just discovered a new favorite Dewey Decimal classification: 303.483. And 303.484! What a shelf.

Seriously, “architecting” is not a word. The verb is “design”.

I’m deeply disturbed by the killings of Alton Sterling and Philando Castile by the police offices who were charged with protecting them.

Goofballs instagram.com/p/BHR-PJHjQmA/

This was totally unintentional! instagram.com/p/BHR926-jfGv/

I’m no expert, but when the media reports on financial anxiety, doesn’t that amplify the anxiety?

Related: npr.org/sections/money…

WIRED: What’s your SpaceX then?

Sanghvi: ”I don’t know yet, if I did I would be doing it.”

❤️💪🤘

wired.com/2014/12/ruchi-…

Only Human: I Thought the Truth Would Be Enough

Illuminating and inspiring!

overcast.fm/+FN4a8doRE

Appalling email from the vultures at @QuickenLoans:

“Turmoil in Europe means opportunity in America” https://t.co/RaqpN2xas4

ClveUxSXIAEwzSX

Wow, I was so corny 3 years ago. https://t.co/8oXXQub8wA

ClvdewOXIAA4vrd

My kingdom for some emotional intelligence!

My favorite screen in Instagram. ❤️😍❤️😍❤️ instagram.com/p/BHAMr18MCdn/

Greenwich! @ Greenwich, Connecticut instagram.com/p/BHAMkFRsCdJ/

Greenwich! @ Greenwich, Connecticut instagram.com/p/BHAMguHsCc8/

Greenwich! @ Greenwich, Connecticut instagram.com/p/BHAJqyjsCWY/

All the ACs were turned off at the office today. My brain shut down due to thermal overload. FML.

Just heard “Come Down” by Anderson Paak for the 1st time, then some frenzied Googling turned this up:

republicordeath.wordpress.com/tag/jewish-mus…

WTF, Monday?

TIL: emoji-driven development is pretty 💩

🤓 CSP in the small, CSP in the large? 🤔

Sweet! Just bought a ticket to @strangeloop_stl!

Folks at @ServerlessConf keep presenting the benefits of #serverless — great, but let’s discuss the trade offs as well!