JSON as a Temporary Database: How We Scaled Data Prototyping

Every startup faces the same impossible paradox: you need users to gather meaningful data, but you need data to make informed decisions about what will attract users. This fundamental tension creates a strategic challenge that every growing company must navigate.

At Til, our guitar lesson marketplace, we hit this exact wall after our initial product launch. We had built a solid platform with video lessons, scheduling tools, and payment processing—but our user acquisition was still heavily reliant on word of mouth and direct outreach.

To break through this ceiling, our CEO decided we should invest in SEO as a sustainable growth channel. The strategy? Generate thousands of location-based landing pages targeting searches like "guitar lessons in [City]" to capture organic traffic from potential students across the country.

I was tasked with leading this project. The goal was conceptually simple but technically nuanced: collect location data, build a flexible system for managing it, and create a pipeline that would power our SEO strategy. The challenge was finding a balance between moving quickly and building something that could adapt to our rapidly evolving needs.

The Problem: Data Fragmentation and Evolving Requirements

The project presented two distinct technical challenges:

Fragmented Data Sources

We needed comprehensive information about US cities including names, states, populations, coordinates, and time zones. No single dataset contained all the information we required—some had accurate population figures but lacked coordinates, while others had detailed geographic information but missing administrative data.

To make matters more complex, we needed to match cities across datasets that sometimes used different naming conventions or abbreviations.
Evolving Data Requirements

Since this was our first major SEO initiative, our understanding of what data we needed was constantly evolving. The marketing team was still refining their strategy based on keyword research, and we were uncertain about which city attributes would be most valuable for targeting or page generation.

Creating a rigid database schema upfront would inevitably lead to migrations, backfills, and potential errors as requirements changed. We needed a more flexible approach that could adapt to our learning process.

Proposed Solution: JSON as a Flexible Database

After discussing options with our CTO, we decided that using JSON files as a temporary database would provide the perfect balance of structure and flexibility. This approach would give us:

Adaptability: We could easily modify our data structure as requirements evolved without complex migrations
Type safety: With TypeScript, we could still maintain strong typing despite the flexible format
Simplicity: The entire team could easily understand and modify the data if needed
Version control: Changes to our dataset could be tracked in git alongside our code
Performance: For our dataset size (thousands, not millions of records), JSON files would be perfectly efficient

This wasn't a permanent solution—eventually we'd migrate to a proper database once our data model stabilized—but it was ideal for our prototyping phase.

Our implementation plan consisted of five steps:

Create a /data-aggregation directory to store all source files and output
Read and parse data from multiple source files
Define a TypeScript interface for our desired output
Merge and transform the datasets into our unified format
Write the consolidated data to a new cities.json file

Implementation

Reading Files in Node.js

Node.js offers several approaches for reading files, each with different performance characteristics:

fs.readFile: Reads the entire file into memory
fs.createReadStream: Processes the file in chunks, ideal for large files
Synchronous versions like fs.readFileSync: Blocks execution until completed

For our prototype, I opted for the Promise-based fs.readFile from the fs/promises module, which provided a clean async/await syntax:

import path from 'path'
import fs from 'fs/promises'

const OPENDATA_CITIES_PATH = path.resolve(
  __dirname,
  `opendata-all-us-cities.json`,
)
const DB_CITIES_PATH = path.resolve(__dirname, `cities.json`)

async function generateUSCitiesData() {
  try {
    // Read data from source files
    const [opendataCitiesData, dbCitiesData] = await Promise.all([
      fs.readFile(OPENDATA_CITIES_PATH, { encoding: 'utf8' }),
      fs.readFile(DB_CITIES_PATH, { encoding: 'utf8' }),
    ])

    // Parse JSON data
    const opendataCities = JSON.parse(opendataCitiesData)
    const dbCities = JSON.parse(dbCitiesData)

    console.log('Read JSON files successfully:', {
      opendataCitiesCount: opendataCities.length,
      dbCitiesCount: dbCities.length,
    })

    // Further processing will happen here...
  } catch (error) {
    throw new Error(`Error reading city data: ${error}`)
  }
}

By using Promise.all, we could read both files concurrently, optimizing our script's performance.

Defining a Type-Safe Data Structure

The next step was defining a TypeScript interface that would represent our unified city data structure:

export type City = {
  type: 'City'
  geoname_id: string // Unique identifier from OpenData
  name: string // City name
  state: string // Full state name
  country: string // Country name
  slug: string // URL-friendly version of city name
  timezone: string // Timezone identifier
  latitude: number // Geographic coordinates
  longitude: number
  population: number // Population count
  country_code: string // ISO country code
  state_id: number // State identifier from internal dataset
  state_code: string // State abbreviation
  nearbyLessons: Place[] | null // Related lessons (null if not yet queried)
}

This structure combined elements from both datasets and added our own fields like slug for URL generation and nearbyLessons for storing related content.

Merging and Transforming Datasets

The core of our implementation was merging data from different sources into a unified structure:

async function generateUSCitiesData() {
  try {
    // ... file reading code from previous example ...

    const citiesData: City[] = [];

    // Use a Set for efficient lookups
    const existingCityIds = new Set(dbCities.map((x) => x.geoname_id));

    for (const city of opendataCities) {
      // Skip if we've already processed this city
      if (existingCityIds.has(city.geoname_id)) {
        const existingCity = dbCities.find((x) => x.geoname_id === city.geoname_id);
        if (existingCity) {
          citiesData.push(existingCity);
        }
        continue;
      }

      // Find matching city in our other dataset by name and location
      const matchingCity = dbCities.find(
        (x) =>
          x.name.toLowerCase() === city.name.toLowerCase() &&
          x.country_code === city.country_code &&
          x.state_code.toLowerCase() === city.admin1_code.toLowerCase(),
      );

      if (!matchingCity) continue; // Skip if no match found

      // Create merged city record with our desired structure
      const mergedCity: City = {
        type: 'City',
        geoname_id: city.geoname_id,
        name: city.name,
        state: matchingCity.state_name,
        country: city.country,
        slug: getCitySlug(city),
        timezone: city.timezone,
        latitude: city.coordinates.lat,
        longitude: city.coordinates.lon,
        population: city.population,
        country_code: city.country_code,
        state_code: city.admin1_code,
        state_id: matchingCity.state_id,
        nearbyLessons: null, // Will be populated later
      };

      citiesData.push(mergedCity);
    }

    // Sort cities by population (largest first)
    const sortedCities = citiesData.sort((a, b) => b.population - a.population);

    console.log(`Generated data for ${sortedCities.length} cities`);

    // Further processing will happen here...
  } catch (error) {
    throw new Error(`Error merging city data: ${error}`);
  }
}

Some key optimization techniques I used:

Creating a Set of existing IDs for O(1) lookups instead of repeated array searches
Using toLowerCase() to ensure case-insensitive matching between datasets
Implementing early continue statements to skip unnecessary processing
Sorting by population to prioritize major cities

Writing the Consolidated Data

With our merged dataset prepared, the final step was writing it to a new JSON file:

// ... continuing from previous example ...

// Write the merged data to our output file
await fs.writeFile(
  ALL_CITIES_PATH,
  JSON.stringify(sortedCities, null, 2), // Pretty-print with 2-space indentation
)

console.log(
  `Successfully wrote ${sortedCities.length} cities to ${ALL_CITIES_PATH}`,
)

We used the same fs/promises module for writing, with pretty-printing enabled to make the file human-readable during development.

Releasing Cities to Production

The second part of our project was selecting which cities to "release" for page generation. We needed to:

Filter cities based on population and available lesson content
Query an external API to find nearby lessons for each city
Create a subset of the most promising cities for our initial release

Here's how we implemented this logic:

async function generateReleasedCitiesData() {
  try {
    // Read the full cities dataset
    const citiesData = JSON.parse(
      await fs.readFile(ALL_CITIES_PATH, { encoding: 'utf8' }),
    )

    // Track which cities are already released
    const alreadyReleasedIds = new Set(
      JSON.parse(
        await fs.readFile(RELEASED_CITIES_PATH, { encoding: 'utf8' }),
      ).map((city) => city.geoname_id),
    )

    // Start with currently released cities
    const releasedCities = [...alreadyReleasedCities]

    // Filter for new candidate cities (not too large, not already released)
    const candidateCities = citiesData
      .filter((city) => {
        return (
          city.population < 500000 && !alreadyReleasedIds.has(city.geoname_id)
        )
      })
      .sort((a, b) => b.population - a.population)

    // Process cities sequentially
    for (const city of candidateCities) {
      // Limit the total number of cities we process
      if (releasedCities.length >= 1000) break

      // Find nearby lessons using external API
      const nearbyLessons = await getNearbyLessons({
        locationName: city.name,
        coordinates: {
          latitude: city.latitude,
          longitude: city.longitude,
        },
      })

      // Only include cities with sufficient nearby content
      if (nearbyLessons.length >= 2) {
        const cityWithLessons = {
          ...city,
          nearbyLessons,
        }
        releasedCities.push(cityWithLessons)
      }
    }

    // Write the final list of released cities
    await fs.writeFile(
      RELEASED_CITIES_PATH,
      JSON.stringify(releasedCities, null, 2),
    )

    console.log(`Released ${releasedCities.length} cities for page generation`)
  } catch (error) {
    throw new Error(`Error generating released cities: ${error}`)
  }
}

Refining the City Release Process

The initial implementation worked, but as with any rapid prototype, there was room for improvement. After getting a working version, I focused on optimizing our approach:

import pMap from 'p-map'

async function generateReleasedCitiesData() {
  try {
    // Load existing data
    const citiesData = JSON.parse(
      await fs.readFile(ALL_CITIES_PATH, { encoding: 'utf8' }),
    )

    const existingReleasedCities = JSON.parse(
      await fs.readFile(RELEASED_CITIES_PATH, { encoding: 'utf8' }),
    )

    // Define clear thresholds as constants instead of magic numbers
    const MAX_POPULATION_THRESHOLD = 500000
    const MIN_NEARBY_LESSONS = 2
    const MAX_CITIES_TO_PROCESS = 1500
    const MAX_CITIES_TO_RELEASE = 1000
    const CONCURRENCY_LIMIT = 5

    // Track existing cities with a Set for O(1) lookups
    const releasedCityIds = new Set(
      existingReleasedCities.map((c) => c.geoname_id),
    )

    // Filter candidates more efficiently
    const candidateCities = citiesData
      .filter(
        (city) =>
          city.population < MAX_POPULATION_THRESHOLD &&
          !releasedCityIds.has(city.geoname_id),
      )
      .sort((a, b) => b.population - a.population)
      .slice(0, MAX_CITIES_TO_PROCESS)

    console.log(`Processing ${candidateCities.length} candidate cities`)

    // Process cities in parallel with controlled concurrency
    const processedCities = await pMap(
      candidateCities,
      async (city) => {
        const nearbyLessons = await getNearbyLessons({
          locationName: city.name,
          coordinates: { latitude: city.latitude, longitude: city.longitude },
        })

        if (nearbyLessons.length >= MIN_NEARBY_LESSONS) {
          return { ...city, nearbyLessons }
        }
        return null
      },
      { concurrency: CONCURRENCY_LIMIT },
    )

    // Combine existing and new cities
    const newReleasedCities = processedCities
      .filter(Boolean)
      .slice(0, MAX_CITIES_TO_RELEASE - existingReleasedCities.length)

    const releasedCities = [...existingReleasedCities, ...newReleasedCities]

    console.log(
      `Adding ${newReleasedCities.length} new cities for a total of ${releasedCities.length}`,
    )

    // Write the final dataset
    await fs.writeFile(
      RELEASED_CITIES_PATH,
      JSON.stringify(releasedCities, null, 2),
    )

    return releasedCities
  } catch (error) {
    throw new Error(`Error generating released cities: ${error}`)
  }
}

Some key optimization techniques I used:

Replaced magic numbers with named constants for better readability and maintainability
Implemented parallel processing with pMap to handle multiple API requests concurrently
Added more detailed logging to track progress during long-running operations
Used Sets for efficient lookups when checking if a city was already processed

What I'd Do Differently Next Time

While the solution worked well for our immediate needs, there are several improvements I'd make if implementing a similar system today:

Use a Single Source of Truth with Status Flags

Instead of creating separate files for all cities and released cities, I'd maintain a single dataset with status flags:

type City = {
  // ...existing fields
  status: 'pending' | 'released' | 'excluded'
  releaseDate?: string
}

This approach would eliminate the need to maintain two separate files that could potentially get out of sync, while still allowing us to filter cities based on their status.

Leverage Node.js Streams for Large Files

Although our files weren't large enough to cause performance issues, a more scalable approach would be to use Node.js streams for processing:

import { createReadStream } from 'fs'
import { createInterface } from 'readline'

// Read and process file line by line
const processLargeFile = async (filePath) => {
  const fileStream = createReadStream(filePath)
  const rl = createInterface({
    input: fileStream,
    crlfDelay: Infinity,
  })

  const processedData = []
  for await (const line of rl) {
    // Process each line
    const item = JSON.parse(line)
    // Transform data...
    processedData.push(transformedItem)
  }

  return processedData
}

This streaming approach would handle files of any size without loading everything into memory at once.

Implement an Automated Release Pipeline

Rather than manually triggering city releases, I would build an automated pipeline:

// In a scheduled task function
export async function scheduledCityRelease() {
  // Load current cities data
  const citiesData = await loadCitiesData()

  // Find eligible cities
  const eligibleCities = citiesData
    .filter((city) => city.status === 'pending')
    .slice(0, BATCH_SIZE)

  // Process in parallel
  const processedCities = await pMap(
    eligibleCities,
    async (city) => {
      const nearbyLessons = await getNearbyLessons(city)
      if (hasSufficientNearbyContent(nearbyLessons)) {
        return {
          ...city,
          status: 'released',
          releaseDate: new Date().toISOString(),
          nearbyLessons,
        }
      }
      return {
        ...city,
        status: 'excluded',
        nearbyLessons,
      }
    },
    { concurrency: CONCURRENCY_LIMIT },
  )

  // Update cities data and save
  await updateCitiesData(processedCities)

  // Trigger page generation for newly released cities
  await generatePages(processedCities.filter((c) => c.status === 'released'))

  return processedCities
}

This automated approach would eliminate manual intervention and ensure consistent, regular releases.

Conclusion

The JSON-as-a-database approach proved to be the right solution for our needs at that stage. It gave us the flexibility to evolve our data model while maintaining type safety, versioning, and developer productivity.

What began as a simple data aggregation project taught me valuable lessons about balancing flexibility with structure, especially in early-stage products where requirements evolve rapidly. Sometimes, the "textbook" solution isn't the right one for your specific context and constraints.

This project reinforced my understanding that technical decisions are rarely just about the code itself—they're about finding the right balance between immediate needs and future flexibility. By prioritizing adaptability, we built a system that could evolve alongside our understanding of what data mattered most for our SEO strategy.

In the months that followed, our approach paid off as we saw organic traffic increase by 40%, validating both the technical implementation and the broader strategy.