Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored avatar

Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored

Under maintenance

Pricing

$2.50 / 1,000 job results

Go to Apify Store
Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored

Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored

Under maintenance

Job data with a correctness guarantee: per-field confidence, ghost-job filtering and cross-source dedup — never silently wrong, duplicated or expired. Reaches LinkedIn and ATS boards cookieless, gets through Cloudflare/DataDome, self-healing on layout shifts. Built on the data.hilgard.cz engine.

Pricing

$2.50 / 1,000 job results

Rating

0.0

(0)

Developer

Jan Hilgard

Jan Hilgard

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

2 days ago

Last modified

Share

Validated Jobs Scraper

Job data that is deduplicated, ghost-filtered and never silently wrong.

You give it a source and a query (an ATS company, or keywords for LinkedIn / Indeed). You get back clean job records — title, company, location, salary, employment and workplace type — and on top of every record a correctness layer: a confidence on each field, a single reliable flag, a ghost-job score, and cross-source dedup. When a field does not clear the bar you get a null with a stated reason (status: "absent"), not a guessed value. When the same opening sits on several boards you get it once, with the others listed under also_at. That guarantee is the product; the engine underneath is only how it is kept.

Most job scrapers optimise for how many fields they return. That is the easy part. The hard part — and the expensive one when it goes wrong — is a job that is silently duplicated, already filled, reposted for the tenth time, or quietly mis-parsed. This one is built around that: it scores its own confidence per field, flags ghost jobs, and collapses duplicates across sources, so what you load is what is real.

Keywords: no silent errors, per-field confidence, validated jobs, ghost job detection, deduplicated jobs, hiring intent, cookieless jobs scraper, linkedin jobs scraper, indeed jobs scraper, ats greenhouse lever ashby, job posting api, labor market data, hiring intent signals, sales prospecting jobs.


What a real run looks like

One query in, one row per job out — each in the engine's snake_case schema, with the per-field confidence and the enrichment block attached. Below is a real record from a Greenhouse run (the two description_* fields are large and omitted here for brevity; every enrichment field is wrapped as { value, status, score, source, model }).

(a) A reliable record — and where the page has no salary, it says so instead of guessing.

{
"source": "greenhouse",
"canonical_url": "https://job-boards.greenhouse.io/gitlab/jobs/8565469002",
"apply_url": "https://job-boards.greenhouse.io/gitlab/jobs/8565469002",
"title": "AI Engineer",
"company": { "name": "GitLab", "url": null },
"location": { "raw": "Remote, US", "city": null, "region": null, "country": null, "workplace_type": "remote" },
"employment_type": null,
"date_posted": "2026-05-29",
"salary": { "min": null, "max": null, "currency": null, "period": null, "source": "absent" },
"seniority": null,
"skills": [],
"reliable": true,
"overall_confidence": 0.95,
"fields": {
"title": { "value": "AI Engineer", "status": "confirmed", "score": 0.95, "source": "ats" },
"company": { "value": "GitLab", "status": "confirmed", "score": 0.95, "source": "ats" },
"location": { "value": "Remote, US", "status": "confirmed", "score": 0.95, "source": "ats" },
"salary": { "value": null, "status": "absent", "score": null, "source": null },
"employment_type": { "value": null, "status": "absent", "score": null, "source": null }
},
"also_at": [],
"enrichment": {
"quality": {
"ghost_job_score": { "value": 0.1, "status": "low", "score": 0.5, "source": "inferred", "model": "qwen3.6-35b@1" },
"flags": { "value": [], "status": "absent", "score": null, "source": "inferred", "model": null },
"is_real": { "value": true, "status": "low", "score": 0.5, "source": "inferred", "model": "qwen3.6-35b@1" }
},
"dedup": {
"is_duplicate_of": { "value": null, "status": "absent", "score": null, "source": "inferred", "model": null },
"sources_seen": { "value": ["greenhouse"], "status": "confirmed", "score": 0.95, "source": "inferred", "model": null }
},
"normalized": {
"role_normalized": { "value": "AI Engineer", "status": "low", "score": null, "source": "inferred", "model": "qwen3.6-35b@1" },
"seniority_normalized": { "value": "Mid", "status": "low", "score": null, "source": "inferred", "model": "qwen3.6-35b@1" },
"skills": { "value": [{ "name": "Python", "type": "nice_to_have", "score": 0.9 }, { "name": "TypeScript", "type": "nice_to_have", "score": 0.9 }, { "name": "LLMs", "type": "nice_to_have", "score": 0.9 }], "status": "high", "score": 0.9, "source": "inferred", "model": "qwen3.6-35b@1" }
},
"hiring_intent": {
"buying_signal_score": { "value": 0.11, "status": "low", "score": 0.45, "source": "inferred", "model": null },
"company_signals": { "value": { "open_roles_count": 1, "role_velocity": 0.03, "expanding_departments": [] }, "status": "low", "score": 0.45, "source": "inferred", "model": null }
}
},
"company_name": "GitLab",
"location_raw": "Remote, US",
"ghost_job_score": 0.1,
"success": true,
"error": null
}

Note the salary and employment_type: the page did not state them, so they come back status: "absent" with a null value — not a guessed band. The ghost_job_score is low (0.1), so the record is trusted. company_name, location_raw and ghost_job_score at the bottom are flat copies the actor lifts out of the nested objects for the dataset table.

(b) A ghost-suspect job with thin data. It does NOT guess — it flags and fails loud. (Illustrative record in the same real shape — a live ghost can't be produced on demand.)

{
"source": "linkedin",
"canonical_url": "https://www.linkedin.com/jobs/view/...",
"title": "Marketing Manager",
"company": { "name": "Stealth Startup", "url": null },
"location": { "raw": null, "city": null, "region": null, "country": null, "workplace_type": null },
"employment_type": null,
"salary": { "min": null, "max": null, "currency": null, "period": null, "source": "absent" },
"reliable": false,
"overall_confidence": 0.39,
"fields": {
"title": { "value": "Marketing Manager", "status": "high", "score": 0.9, "source": "html" },
"company": { "value": "Stealth Startup", "status": "low", "score": 0.41, "source": "html" },
"location": { "value": null, "status": "absent", "score": null, "source": null },
"salary": { "value": null, "status": "absent", "score": null, "source": null }
},
"also_at": [],
"enrichment": {
"quality": {
"ghost_job_score": { "value": 0.78, "status": "low", "score": 0.6, "source": "inferred", "model": "qwen3.6-35b@1" },
"flags": { "value": ["reposted", "vague_jd"], "status": "high", "score": 0.7, "source": "inferred", "model": "qwen3.6-35b@1" },
"is_real": { "value": false, "status": "low", "score": 0.6, "source": "inferred", "model": "qwen3.6-35b@1" }
}
},
"ghost_job_score": 0.78,
"success": true,
"error": null
}

The value is the second row. A cheaper tool would have returned this as just another clean-looking hit. Here the ghost_job_score is high (0.78), the flags say why (reposted, vague_jd), is_real is false, the weak company field is low, and the absent ones are marked absent — not filled with a guess. (Note: on LinkedIn an empty apply link is normal, so no apply-related flag fires — the no_real_apply flag only fires on ATS boards where a real apply path is expected.) Turn drop_ghost on and a row like this is removed before it reaches you — and not charged.


Why this beats LinkedIn-only and AI scrapers

Adapting to a layout and scraping a lot of fields is table-stakes now — this does both. But scraping is not the same as being right. A scraper can return a job and still hand you one that is duplicated three times, already filled, reposted for months, or quietly mis-parsed — and say nothing. That silent bad row is the one that costs you, because you act on it. The difference here is the guarantee, not the scraping: every field carries its own confidence, every record carries a ghost-job score, duplicates are collapsed across sources, and a cheap enrichment runs on every record — not just a sampled few.

Two things others quietly skip:

  • They return rich fields but zero confidence, and their high success rate needs your cookies. This runs cookieless and still attaches a confidence to every field, so you can tell a solid row from a shaky one without logging anything in.
  • Their AI enrichment is shallow and expensive because it is API-bound. Ours runs on cheap local inference, so the quality / dedup / normalize / hiring-intent layers run on every record by default, not as a costly add-on on a handful.

Why it is different

  • No silent errors. Every field carries a confidence, the whole record carries a reliable flag. Below the bar a field is returned null with status: "absent" and reliable: false, never a confident-looking wrong value.
  • Ghost-job filtering and cross-source dedup. Each record gets a ghost_job_score with the flags behind it (reposted, evergreen, vague_jd, staffing_agency, perpetual_req, and no_real_apply on ATS), so stale and fake openings are visible — or dropped with drop_ghost. The same opening seen on several sources is collapsed into one record, the rest listed under also_at. You load real, distinct openings, not reposts and duplicates.
  • Cookieless reach / anti-bot. It reaches LinkedIn, Indeed and the major ATS boards without cookies, and gets through heavy protections — Cloudflare, DataDome and similar — that return a challenge page to a plain fetch. (Anti-bot is an arms race, so this is a capability, not a guarantee against any one named vendor.)
  • Self-healing — a mechanism that serves the correctness above, not the headline. When a board changes its markup, the engine re-finds fields by meaning instead of silently breaking on a selector.
  • Hiring-intent signals. Enrichment normalises each role and adds hiring-intent signals (buying_signal_score, company_signals), so the data is usable for prospecting, labor analytics and recruiting research, not just a flat list of postings.

Supported sources

Live-verified sources only — this list is what is actually tested, not a wishlist:

  • Greenhouse
  • Ashby
  • Lever
  • SmartRecruiters
  • RemoteOK
  • LinkedIn
  • Indeed

ATS sources (Greenhouse / Ashby / Lever / SmartRecruiters) take a company slug; LinkedIn, Indeed and RemoteOK take keywords (+ location). A concrete source is required. Indeed sits behind Cloudflare — it is reached through the same anti-bot stack, cookieless. Indeed pay is usually an estimate, so it comes back as

salary.source: "inferred"
(never passed off as an employer-stated figure).


Input

{
"source": "greenhouse", // required: linkedin | indeed | greenhouse | lever | ashby | smartrecruiters | remoteok
"company": "gitlab", // ATS slug, for greenhouse/lever/ashby/smartrecruiters
"keywords": "backend engineer", // for LinkedIn / Indeed / RemoteOK search
"location": "Berlin",
"title_include": [], "title_exclude": [],
"employment_type": [], "workplace_type": [],
"country": [], "language": [], // arrays, e.g. ["DE"], ["en"]
"posted_within_days": 30,
"drop_expired": true, "drop_ghost": false,
"enrich": true,
"enrich_layers": ["quality", "dedup", "normalize", "hiring_intent"],
"dedup_across_sources": true,
"ghost_threshold": 0.7,
"start": 0, "limit": 25, "max_results": 100, "fetch_all": false,
"include_description": true // engine fetches full JD HTML/text (on by default; LinkedIn fetch is per-posting)
}

What a source needs depends on the source: ATS boards (Greenhouse / Lever / Ashby / SmartRecruiters) require company; LinkedIn and Indeed require keywords (or location); RemoteOK needs neither (it lists the feed). Pick a company for an ATS source and the actor fails loud early if it's missing, instead of forwarding a request the engine would reject. Filters and enrichment all run on the engine; the actor just forwards them. max_results caps how many jobs come back, and since you pay per returned job, it caps spend.

Output

One dataset row per job, in the engine's snake_case JobPosting schema. Every tracked field is always present — a missing one comes back with status: "absent" and a null value, never silently dropped. Highlights:

  • Core fields: title, company ({name, url}), location ({raw, city, region, country, workplace_type}), employment_type (FULL_TIME / PART_TIME / CONTRACT / INTERN / TEMP), date_posted, salary ({min, max, currency, period, source}source is explicit / inferred / absent, never guessed), seniority, skills, canonical_url, apply_url.
  • Trust layer: reliable (bool), overall_confidence (0–1), and fields — a map where each tracked field carries { value, status, score, source }, with status ∈ confirmed | high | low | absent.
  • Dedup: top-level also_at[] lists the same opening on other portals ([] if unique).
  • Enrichment (present when enrich is on) under enrichment, each field wrapped as { value, status, score, source: "inferred", model }: quality (ghost_job_score, flags, is_real), dedup, normalized, hiring_intent.
  • Flat helpers the actor adds for the dataset table: company_name, location_raw, and ghost_job_score (lifted from enrichment.quality.ghost_job_score.value).

success says the engine produced a job record; reliable says that record cleared the trust threshold. They diverge when a job is extracted but the engine is not confident — then success is true, reliable is false, and the per-field scores stay low, so a hedge never reads as a clean hit. Each row is the engine's record passed through verbatim — the actor never drops or rewrites a field. The full job description (description_html / description_text) is included; include_description (on by default) controls whether the engine fetches it for sources fetched per-posting (e.g. LinkedIn). Descriptions can be large.


Pricing

This actor uses Pay Per Event, with a single event:

  • job-result — one flat fee per job returned.

You are charged once per job the engine returns — whether it comes back reliable: true or, honestly, reliable: false, because an honest fail is still a result you can act on (you learn the field is shaky instead of trusting a guess). Jobs the engine drops before returning — expired, ghost (with drop_ghost), or duplicates collapsed across sources — are not returned and not charged. A run that fails before returning any results is not charged. The current price of the event is in the Apify Console pricing tab.

No flat monthly fee, no per-seat pricing. One predictable price per validated job.


A note on data

This actor returns job and company data only — title, company, location, salary, employment terms, and signals derived from the posting. It does not collect or return personal data of applicants or named recruiters.


About

Built on the data.hilgard.cz engine — the same self-healing stack that does cookieless anti-bot fetching, extraction by meaning, and independent verification, here applied to jobs with ghost-job scoring and cross-source dedup on top.

By Jan Hilgard — founder of Hosting90 (built 2002, exited 2020), contributor to vllm-mlx. The precision-first stance is deliberate: I would rather return an honest "not sure" than a confident wrong row you build a pipeline on.

Development

npm install
npm run build # tsc → dist/
npm run start:dev # tsx src/main.ts, reads .actor/INPUT.json