On one gig, I worked at a fairly new startup. My years of real time market systems mattered here. They were taking in near real time IoT data and at 1mil per minute on my starting day it was going to be entertaining.
I was working under one of the CTOs leads but I was there to troubleshoot and I loved the relationship I had with him.
Most of it was built into a multi stage pipeline and I was mostly involved with the front end ingestion and before any being passed to the analytics engine.
The Front end was based on Apache NiFi. I have no idea where that product wound up but it was designed to take everything from a UDP feed to scanning FTP sites every minute to grab more data if it already ran out.
NiFi then would let you do decent filtering and transform. But it was designed to scale for both raw ingestion and ETL so you could shift workloads and expand capacity on demand. It had a module that would reconfigure AWS EC2 scaling to provision what it wanted vs praying AWS rules could figure it out.
It could shift resources dynamicly between Ingres and ETL because it decided where the code ran and if it needed more it would bring more cluster members online but that was not going to be available within the next minute, maybe 5.
So I worked with two teams on NiFi tuning the injestion front end and the team handing the ETL stuff. It was always bouncing between because once we found the latest bottleneck it was on to the next thing needing attention.
Then there was the BigData team. The lead I had worked with at another gig so we had a decent friendship already and he had built a great team. So they were storing data in Cassandra and running a 2400 node 800 shard environment just for production.
It wasn't just the data, because on the same nodes were the infrastructure for Apache Spark. I forget the name of the more generic worker model but Spark was a functional programming model over something like python C++ or a few others.
So reach into the back shelf of my mind for C++ skills, everything I had learned at the previous gig from the same man about Functional Programming and put it to use in a highly distributed a environment where your code is brought to the data rather than the data being brought to the code.
That was a fun learning experience to structure the data that best fit the analytics patterns.
The resulting synthetics were stored in another dataset within Cassandra but on a smaller subset of nodes so we could improve the performance for the next layer downstream.
The downstream workload ran on Mesosphere which became DC/OS and it was a container scheduler like AWS ECS and all of the EC2 build details. Like NIFI it could adjust cluster size based on workload.
I mostly was involved at different levels with each of the teams in that layer. Some were well skilled,
No comments:
Post a Comment