how to handle data at scale?
That’s the most pertinent question these days. Unfortunately, no single answer suffices. Here’s what we do at my workplace:
- identify available sources of data
- categorize by type, pricing and other metadata
- determine scope of data
- configure a virtual link to sources
- expose via web-API
- build custom APIs as needed
identify data
Many short-run or even skip this part. I feel that’s a mistake, unless there are only a few categories and/or sources. For those able, this is the best non-techie task for many techies – so why not start there?
Company or management usually has a wish-list. First filter that down to what’s available and not, and for what reason.
Another good one is cataloguing the method of data-source access.
categorize by metadata
How can the data be found or searched for? Numerous criteria for locating the data should be brainstormed, if not already available. Maybe it’s the flavor of social media that it comes from or related to. Contacts or other means to get access to it, should definitely be part of this.
If the data is publicly available, it should be included.
easiest, best case scenario: include public data API URI
determine what to include
everything – is not an option. Use the metadata to set priorities. The smaller or more divided the better.
select not more than a handful. How many depends on both the business and technology. Is there a set to make the aggregate meaningful? In what context?
tech might want to test threshold on something, given supposed high-volume of data
setup the virtual
finally we get to the actual implementation. For the first run, do the simplest, or maybe the smallest. Continue to add the source definitions and/or mappings, after at least the first one is exposed via a web service.
the sheer quantity of sources should take up some time to configure, depending on how organized the end-points need to be.
customize as needed
no first release candidate is going to be perfect. Different clients need different things. They may need it in as a SQL service, not a web service. Different representations to fit varying modes of consumption.
reap the benefits
this is so open, I can only say it depends on how the implementation went, as well as the allowed throughput from the sources.
there is another side, however. We need usage and performance metrics. How we will tell useless skeletons from data with some meat. How attractive one data gets over time, as well as lose its utility.