Test Data Generation

Pointblank provides a built-in test data generation system that creates realistic, locale-aware synthetic data based on schema definitions. This is useful for testing validation rules, creating sample datasets, and generating fixture data for development.

Note

Throughout this guide, we use pb.preview() to display generated datasets with nice HTML formatting. This is optional: pb.generate_dataset() returns a standard DataFrame that you can display or manipulate however you prefer.

Quick Start

Generate test data using a schema with field constraints:

import pointblank as pb

# Define a schema with typed field specifications
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=80),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate 100 rows of test data (seed ensures reproducibility)
pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns5
user_id
Int64
name
String
email
String
age
Int64
status
String
1 7188536481533917197 Doris Martin d_martin@aol.com 77 pending
2 2674009078779859984 Nancy Gonzalez nancygonzalez@icloud.com 67 active
3 7652102777077138151 Jessica Turner jturner@aol.com 78 active
4 157503859921753049 George Evans georgeevans@zoho.com 36 inactive
5 2829213282471975080 Patricia Williams pwilliams@outlook.com 75 pending
96 7027508096731143831 Isaiah Murphy isaiah.murphy@zoho.com 55 active
97 6055996548456656575 Brittany Rodriguez brodriguez@yandex.com 39 inactive
98 3822709996092631588 Megan Stevens mstevens26@aol.com 24 inactive
99 1522653102058131295 Pamela Jenkins pjenkins29@yandex.com 41 active
100 5690877051669225499 Stephanie Santos stephanie.santos40@gmail.com 75 pending

Field Types

Pointblank provides helper functions for defining typed columns with constraints:

Function Description Key Parameters
int_field() Integer columns min_val, max_val, allowed, unique
float_field() Float columns min_val, max_val, allowed
string_field() String columns preset, pattern, allowed, unique
bool_field() Boolean columns p_true (probability of True)
date_field() Date columns min_val, max_val
datetime_field() Datetime columns min_val, max_val
time_field() Time columns min_val, max_val
duration_field() Duration columns min_val, max_val

Integer Fields

Integer fields support range constraints with min_val and max_val, discrete allowed values with allowed, and uniqueness enforcement with unique=True:

schema = pb.Schema(
    id=pb.int_field(min_val=1000, max_val=9999, unique=True),
    quantity=pb.int_field(min_val=1, max_val=100),
    rating=pb.int_field(allowed=[1, 2, 3, 4, 5]),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
id
Int64
quantity
Int64
rating
Int64
1 5749 100 3
2 2368 38 1
3 1279 11 1
4 6025 3 5
5 7942 76 3
96 5330 64 2
97 8634 31 1
98 9982 43 2
99 4221 70 1
100 8520 19 5

The unique=True constraint ensures no duplicate values appear in that column, which is useful for generating primary keys or identifiers.

Float Fields

Float fields work similarly to integers, with min_val and max_val defining the range of generated values:

schema = pb.Schema(
    price=pb.float_field(min_val=0.0, max_val=1000.0),
    discount=pb.float_field(min_val=0.0, max_val=0.5),
    temperature=pb.float_field(min_val=-40.0, max_val=50.0),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
price
Float64
discount
Float64
temperature
Float64
1 924.8652516259452 0.4624326258129726 43.23787264633508
2 948.6057779931772 0.47430288899658857 45.37452001938594
3 892.4333440485793 0.44621667202428966 40.31900096437214
4 83.55067683068363 0.04177533841534181 -32.48043908523847
5 592.0272268857353 0.29601361344286764 13.282450419716177
96 444.6925279641446 0.2223462639820723 0.022327516773010814
97 342.7762214585577 0.17138811072927884 -9.150140068729808
98 892.3288689140903 0.4461644344570452 40.309598202268134
99 813.7559456012128 0.4068779728006064 33.238035104109144
100 895.1816604808429 0.44759083024042146 40.56634944327587

Values are uniformly distributed across the specified range, making this useful for simulating measurements, prices, or any continuous numeric data.

String Fields with Presets

Presets generate realistic data like names, emails, and addresses. When you include related fields like name and email in the same schema, Pointblank ensures coherence (e.g., the email address will be derived from the person’s name), making the generated data more realistic:

schema = pb.Schema(
    full_name=pb.string_field(preset="name"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns4
full_name
String
email
String
company
String
city
String
1 Weston Parker weston.parker23@gmail.com Innovative Systems Solutions Lubbock
2 Hazel Torres hazel723@hotmail.com Sterling Engineering Anaheim
3 Lawrence Mitchell lawrence_mitchell@zoho.com Goldman Sachs Phoenix
4 Maria Garcia m_garcia@hotmail.com Evans Group Denver
5 Michael Hoffman michael.hoffman@gmail.com Goodwin and Garrett San Antonio
96 Daniel Torres daniel_torres@icloud.com Henry Construction El Paso
97 Helen Simpson hsimpson20@yandex.com Thompson Technologies El Paso
98 Mark Graham mark.graham65@mail.com Universal Consulting Charlotte
99 Brian Moore bmoore95@zoho.com Long Industries Los Angeles
100 Michael Ward michael_ward@yahoo.com Pioneer Solutions San Diego

This coherence extends to other related fields like user_name, which will also reflect the person’s name when included alongside name and email fields.

String Fields with Patterns

Use regex patterns to generate strings matching specific formats:

schema = pb.Schema(
    product_code=pb.string_field(pattern=r"[A-Z]{3}-[0-9]{4}"),
    phone=pb.string_field(pattern=r"\([0-9]{3}\) [0-9]{3}-[0-9]{4}"),
    hex_color=pb.string_field(pattern=r"#[0-9A-F]{6}"),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
product_code
String
phone
String
hex_color
String
1 CAS-6685 (109) 668-2347 #209DCB
2 XGI-0397 (397) 117-0865 #68E07E
3 DCW-6086 (309) 293-9594 #32FD0D
4 YBG-9529 (917) 797-2285 #161B56
5 XLS-9459 (911) 609-9495 #B9A2F5
96 THG-2900 (993) 511-5415 #A7A37B
97 CHC-3681 (065) 802-0822 #47E498
98 HKT-3552 (927) 701-4276 #AF75D8
99 OEW-4157 (365) 419-1062 #5CCD95
100 FSX-8948 (897) 459-3038 #0F3220

Patterns support standard regex character classes and quantifiers, giving you flexibility to generate data matching virtually any format specification.

Boolean Fields

Control the probability of True values:

schema = pb.Schema(
    is_active=pb.bool_field(p_true=0.8),      # 80% True
    is_premium=pb.bool_field(p_true=0.2),     # 20% True
    is_verified=pb.bool_field(),              # 50% True (default)
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns3
is_active
Boolean
is_premium
Boolean
is_verified
Boolean
1 False False False
2 False False False
3 False False False
4 True True True
5 True False False
96 True False True
97 True False True
98 False False False
99 False False False
100 False False False

This probabilistic control is helpful when you need to simulate real-world distributions where certain states are more common than others.

Date and Datetime Fields

Temporal fields accept Python date and datetime objects for their range boundaries, generating values uniformly distributed within the specified period:

from datetime import date, datetime

schema = pb.Schema(
    birth_date=pb.date_field(
        min_date=date(1960, 1, 1),
        max_date=date(2005, 12, 31)
    ),
    created_at=pb.datetime_field(
        min_date=datetime(2024, 1, 1),
        max_date=datetime(2024, 12, 31)
    ),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23))
PolarsRows100Columns2
birth_date
Date
created_at
Datetime
1 1986-01-03 2024-12-25 04:22:08
2 1967-06-30 2024-10-29 16:22:23
3 1961-07-13 2024-04-22 14:13:08
4 1987-07-09 2024-12-12 14:04:53
5 1998-01-06 2024-11-18 04:49:47
96 1969-04-14 2024-07-29 13:15:44
97 1975-03-23 2024-04-28 08:49:29
98 1981-05-29 2024-12-13 09:42:37
99 1982-09-14 2024-10-28 23:35:39
100 1968-12-21 2024-06-25 14:22:27

The same pattern applies to time_field() and duration_field(), allowing you to generate realistic temporal data for any use case.

Available Presets

The preset= parameter in string_field() supports many data types:

Personal Data:

  • name: full name (first + last)
  • name_full: full name with optional prefix/suffix (e.g., “Dr. Ana Sousa”, “Prof. Tanaka Yuki”)
  • first_name: first name only
  • last_name: last name only
  • email: email address
  • phone_number: phone number in country-specific format

Location Data:

  • address: full street address
  • city: city name
  • state: state/province name
  • country: country name
  • postcode: postal/ZIP code
  • latitude: latitude coordinate
  • longitude: longitude coordinate

Business Data:

  • company: company name
  • job: job title
  • catch_phrase: business catch phrase

Internet Data:

  • url: website URL
  • domain_name: domain name
  • ipv4: IPv4 address
  • ipv6: IPv6 address
  • user_name: username
  • password: password

Financial Data:

  • credit_card_number: credit card number
  • iban: International Bank Account Number
  • currency_code: currency code (USD, EUR, etc.)

Identifiers:

  • uuid4: UUID version 4
  • md5: MD5 hash (32 hex characters)
  • sha1: SHA-1 hash (40 hex characters)
  • sha256: SHA-256 hash (64 hex characters)
  • ssn: Social Security Number (country-specific format)
  • license_plate: vehicle license plate (location-aware for CA, US, DE, AU, GB)

Barcodes:

  • ean8: EAN-8 barcode with valid check digit
  • ean13: EAN-13 barcode with valid check digit

Date/Time:

  • date_this_year: a date within the current year
  • date_this_decade: a date within the current decade
  • date_between: a random date between 2000 and 2025
  • date_range: two dates joined with an en-dash (e.g., "2012-05-12 – 2015-11-22")
  • future_date: a date up to 1 year in the future
  • past_date: a date up to 10 years in the past
  • time: a time value

Text:

  • word: single word
  • sentence: full sentence
  • paragraph: paragraph of text
  • text: multiple paragraphs

Miscellaneous:

  • color_name: color name
  • file_name: file name
  • file_extension: file extension
  • mime_type: MIME type
  • user_agent: browser user agent string (country-weighted)

Country-Specific Data

One of the most powerful features is generating locale-aware data. Use the country= parameter to generate data specific to a country. This affects names, cities, addresses, and other locale-sensitive presets.

Let’s create a schema that includes several location-related fields. When generating data for a specific country, Pointblank ensures consistency across related fields. The city, address, postcode, and coordinates will all correspond to the same location:

# Schema with linked location fields
schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    address=pb.string_field(preset="address"),
    postcode=pb.string_field(preset="postcode"),
    latitude=pb.string_field(preset="latitude"),
    longitude=pb.string_field(preset="longitude"),
)

Here’s German data with authentic names and addresses from cities like Berlin, Munich, and Hamburg. Notice how the latitude/longitude coordinates match real locations in Germany:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="DE"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Niklas Schulte Potsdam Jägertor 8211, Whg. 737, 14097 Potsdam 14448 52.428914 13.064566
2 Erik Becker Halle (Saale) Hansering 9509, Whg. 636, 06678 Halle (Saale) 06101 51.484572 11.937119
3 Marco Albrecht Frankfurt am Main Gartenstraße 9713, Whg. 474, 60597 Frankfurt am Main 60674 50.212245 8.711472
4 Juliane Münz Leipzig Lindenauer Markt 6249, Whg. 489, 04541 Leipzig 04992 51.276862 12.458890
5 Anton Baumann Köln Aachener Straße 7203, 50125 Köln 50589 50.967264 6.795838
196 Franziska Wendt Ulm Marktplatz 6251, Whg. 535, 89984 Ulm 89226 48.395296 10.001962
197 Lennart Berger München Brienner Straße 1390, Whg. 389, 80255 München 80835 48.206882 11.674262
198 Julia Knecht Ludwigshafen am Rhein Friedrichstraße 3204, 67944 Ludwigshafen am Rhein 67305 49.473668 8.437782
199 Sebastian Thiel Gelsenkirchen Husemannstraße 453, Whg. 273, 45732 Gelsenkirchen 45992 51.568689 7.082531
200 Trude Kaiser Kassel Königstraße 1394, 34406 Kassel 34736 51.326544 9.494319

Japanese data includes names in romanized form and addresses from cities like Tokyo, Osaka, and Kyoto. The coordinates fall within Japan’s geographic boundaries:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="JP"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Kenji Ozawa Kasuga 816-3132 Fukuoka Kasuga Shirane 2869-411 816-5387 33.532717 130.479907
2 Yuki Sugimoto Higashihiroshima 739-7478 Hiroshima Higashihiroshima Kouchi-dori 5279-446 739-6297 34.402366 132.767214
3 Osamu Takahashi Fukuoka 810-8705 Fukuoka Fukuoka Jonan-dori 4932-302 810-3425 33.676810 130.414218
4 Yuki Sakai Saitama 330-3255 Saitama Saitama Minuma-dori 3157 330-4163 35.927124 139.649673
5 Haruka Yamazaki Kitakyushu 802-7566 Fukuoka Kitakyushu Shin-Kokura-dori 281 802-7878 33.908803 130.911548
196 Mayumi Takahashi Fukuoka 810-7017 Fukuoka Fukuoka Imajuku-dori 9700 810-1106 33.558128 130.415678
197 Katsuya Kawamura Hiroshima 730-1406 Hiroshima Hiroshima Hakushima-dori 281 730-6070 34.398047 132.440192
198 Kazue Nakata Sendai 980-1193 Miyagi Sendai Itsutsubashi-dori 1182 980-1544 38.255431 140.876065
199 Nozomi Shiraishi Matsudo 271-5711 Chiba Matsudo Asahi-dori 584 271-5161 35.818062 139.882260
200 Ayaka Asano Kure 737-9841 Hiroshima Kure Tsukiji-dori 3763 737-7152 34.261593 132.556836

Brazilian data features Portuguese names and addresses from cities like São Paulo, Rio de Janeiro, and Brasília. The postal codes follow Brazil’s CEP format:

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="BR"))
PolarsRows200Columns6
name
String
city
String
address
String
postcode
String
latitude
String
longitude
String
1 Fábio Santos Campinas Avenida Anchieta, 1624, Apto 677, 13239-778 Campinas - SP 13305-992 -22.926542 -47.051795
2 João Vieira Porto Alegre Rua Voluntários da Pátria, 2869, 90313-262 Porto Alegre - RS 90736-876 -30.053404 -51.210651
3 Letícia Costa Rio de Janeiro Avenida Maracanã, 4893, Apto 709, 20863-487 Rio de Janeiro - RJ 20608-707 -22.935222 -43.211152
4 Regina Ferreira Belo Horizonte Rua Guajajaras, 5690, 30633-255 Belo Horizonte - MG 30735-963 -19.893544 -43.933750
5 Francisco Rodrigues Belo Horizonte Rua Turquesa, 281, Apto 515, 30775-668 Belo Horizonte - MG 30336-542 -19.899167 -43.917805
196 Júlia Martins São Paulo Rua do Triunfo, 1651, Apto 805, 01237-943 São Paulo - SP 01623-173 -23.519871 -46.655368
197 Júlia Moreira Recife Avenida Professor José dos Anjos, 2317, 50139-444 Recife - PE 50306-672 -8.080424 -34.891849
198 Rodrigo Moura Salvador Avenida Paulo VI, 8262, 40439-595 Salvador - BA 40131-158 -12.985375 -38.520443
199 Iraci Lima São Paulo Alameda Jaú, 5338, 01144-803 São Paulo - SP 01991-418 -23.527410 -46.680338
200 Bianca Medeiros Curitiba Avenida Winston Churchill, 2427, 80226-835 Curitiba - PR 80461-489 -25.447445 -49.281284

This location coherence is valuable when testing geospatial applications, address validation systems, or any scenario where realistic, internally-consistent location data matters.

Data Coherence

Pointblank automatically links related columns to produce realistic rows. There are three coherence systems that activate based on which presets appear together in a schema:

Address coherence activates when any address-related preset is present (address, city, state, postcode, latitude, longitude, phone_number, license_plate). All of these fields will refer to the same location within each row.

Person coherence activates when any person-related preset is present (name, name_full, first_name, last_name, email, user_name). The email and username are derived from the person’s name.

Business coherence activates when both job and company are present. When active:

  • the company and job title are drawn from the same industry (e.g., a nurse will work at a hospital, not a law firm).
  • name_full gains profession-matched titles: a doctor may appear as “Dr. Ana Sousa” and a professor as “Prof. Tanaka Yuki”. For German-speaking countries (DE, AT, CH), the honorific stacks before the professional title (e.g., “Herr Dr. med. Klaus Weber”).
  • integer columns whose name contains age (e.g., age, person_age) are automatically constrained to working-age range (22–65).

Here’s an example showing all three coherence systems working together:

schema = pb.Schema(
    name=pb.string_field(preset="name_full"),
    email=pb.string_field(preset="email"),
    company=pb.string_field(preset="company"),
    job=pb.string_field(preset="job"),
    city=pb.string_field(preset="city"),
    state=pb.string_field(preset="state"),
    license_plate=pb.string_field(preset="license_plate"),
    age=pb.int_field(),
)

pb.preview(pb.generate_dataset(schema, n=100, seed=23, country="DE"))
PolarsRows100Columns8
name
String
email
String
company
String
job
String
city
String
state
String
license_plate
String
age
Int64
1 Frau Agathe Kramer agathekramer@mail.de Berliner Systeme Software Produktmanager Potsdam Brandenburg P-ZL 931 40
2 Herr Rüdiger Altmann ruediger.altmann@mail.de KPMG Business Analyst Halle (Saale) Sachsen-Anhalt HAL-D 6183 27
3 Herr Bodo Meyer b_meyer@outlook.de Internationale Strom Netzwerke Systemadministrator Frankfurt am Main Hessen F-DZ 0091 23
4 Frau Stefanie Meyer stefanie.meyer@posteo.de Frankfurter Sicherheit Cloud-Architekt Leipzig Sachsen L-X 1095 59
5 Herr Fabian Thomas f_thomas@freenet.de Scherer Immobilien Immobilienmakler Köln Nordrhein-Westfalen K-P 299 41
96 Frau Prof. Maike Fuchs maike.fuchs37@arcor.de Technische Universität Frankfurt am Main Professor Frankfurt am Main Hessen F-A 2325 24
97 Herr Sebastian König sebastiankoenig@web.de Premium Kreativ Grafikdesigner Augsburg Bayern A-MY 8326 25
98 Frau Diana Henkel diana990@web.de Sparkasse Frankfurt am Main Finanzanalyst Frankfurt am Main Hessen F-F 1377 62
99 Herr Matthias Keßler matthiaskessler@yahoo.de Schrader Gruppe Grafikdesigner Zwickau Sachsen Z-W 569 48
100 Herr Dirk Neumann dirk_neumann@arcor.de Deutsche Software Digital Produktmanager Nürnberg Bayern N-K 1620 47

License plate coherence is part of address coherence. For CA, US, DE, AU, and GB, license plates follow real subregion-specific formats when location fields are present. For example, an Ontario row produces plates like "CABC 123" while a British Columbia row produces "AB1 23C". Letters I, O, Q, and U are excluded from plate generation, matching real-world restrictions.

Supported Countries

Pointblank currently supports 55 countries with full locale data for realistic test data generation. You can use either ISO 3166-1 alpha-2 codes (e.g., "US") or alpha-3 codes (e.g., "USA").

Europe (32 countries):

  • Austria (AT), Belgium (BE), Bulgaria (BG), Croatia (HR), Cyprus (CY), Czech Republic (CZ), Denmark (DK), Estonia (EE), Finland (FI), France (FR), Germany (DE), Greece (GR), Hungary (HU), Iceland (IS), Ireland (IE), Italy (IT), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Netherlands (NL), Norway (NO), Poland (PL), Portugal (PT), Romania (RO), Russia (RU), Slovakia (SK), Slovenia (SI), Spain (ES), Sweden (SE), Switzerland (CH), United Kingdom (GB)

Americas (7 countries):

  • Argentina (AR), Brazil (BR), Canada (CA), Chile (CL), Colombia (CO), Mexico (MX), United States (US)

Asia-Pacific (12 countries):

  • Australia (AU), China (CN), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), New Zealand (NZ), Philippines (PH), Singapore (SG), South Korea (KR), Taiwan (TW), Thailand (TH)

Middle East & Africa (4 countries):

  • Nigeria (NG), South Africa (ZA), Turkey (TR), United Arab Emirates (AE)

Additional countries and expanded coverage are planned for future releases.

Mixing Multiple Countries

When you need test data that spans multiple locales (e.g., simulating an international customer base), you can pass a list or dict to the country= parameter instead of a single string.

Passing a list of country codes splits rows equally across those countries. Here, 200 rows are divided evenly among the US, Germany, and Japan (~67 each):

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
    postcode=pb.string_field(preset="postcode"),
)

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country=["US", "DE", "JP"]))
PolarsRows200Columns3
name
String
city
String
postcode
String
1 Ingo Eder Mannheim 68638
2 Natsuko Maeda Umeda 530-4019
3 Carol Ramirez Denver 80269
4 Robert Kennedy Orlando 32880
5 Roland Schubert Düsseldorf 40135
196 Satoru Yamakawa Iwata 438-0548
197 Ezekiel Gibson Dallas 75282
198 Paul Cole Chicago 60694
199 Diana Huber Dortmund 44140
200 Mayumi Maruyama Sapporo 060-9195

To control the proportion of rows per country, pass a dict mapping country codes to weights. The following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:

pb.preview(
    pb.generate_dataset(
        schema, n=200, seed=23,
        country={"US": 0.7, "DE": 0.2, "FR": 0.1},
    )
)
PolarsRows200Columns3
name
String
city
String
postcode
String
1 Michael Flores Memphis 38122
2 Dominik Köhler Düsseldorf 40103
3 Emilia Torres Denver 80247
4 Sara Baker Orlando 32889
5 Paul Wood South Bend 46684
196 Norbert Wagner Darmstadt 64156
197 Carolyn Howe Dallas 75273
198 Aaliyah Thompson Chicago 60696
199 Andrew Morgan Chicago 60647
200 Laura Carré Toulouse 31066

Weights are auto-normalized, so {"US": 7, "DE": 2, "FR": 1} is equivalent to the example above. Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly n.

By default, rows from different countries are interleaved randomly (shuffle=True). Set shuffle=False to keep rows grouped by country in the order the countries are listed:

pb.preview(
    pb.generate_dataset(
        schema, n=120, seed=23,
        country=["US", "DE", "JP"], shuffle=False,
    )
)
PolarsRows120Columns3
name
String
city
String
postcode
String
1 Amanda Manning Bowie 20747
2 Brian Martinez San Diego 92113
3 Zachary Wilson Syracuse 13284
4 Gary Carpenter Scottsdale 85271
5 Ryan Walker Dallas 75232
116 Takuya Fujimura Koshigaya 343-8146
117 Daisuke Fujii Yokohama 220-9823
118 Naoya Kishimoto Kasuga 816-7579
119 Akiko Izumi Nishinomiya 662-7720
120 Megumi Uchida Kobe 650-2226

All coherence systems (address, person, business) work correctly within each country’s batch of rows. A French row will have a French name with a matching French email; a Japanese row will have a Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates) are generated independently for each batch but still respect their field constraints.

Frequency-Weighted Sampling

By default, names and cities are sampled uniformly at random from the locale data, giving every entry the same probability of being selected. Real-world distributions are far from uniform though: “James” and “Maria” appear orders of magnitude more often than “Thaddeus” or “Xiomara”, and more people live in New York City than in Flagstaff. The weighted=True parameter makes generated data reflect this natural skew.

schema = pb.Schema(
    name=pb.string_field(preset="name"),
    city=pb.string_field(preset="city"),
)

pb.preview(pb.generate_dataset(schema, n=200, seed=23, country="US", weighted=True))
PolarsRows200Columns2
name
String
city
String
1 Adam Knight Lubbock
2 Noah Gibson Anaheim
3 Jackson Morales Phoenix
4 Mark Wilson Denver
5 Maisie Perkins San Antonio
196 Daniel Woods Philadelphia
197 Christina Anderson Los Angeles
198 Thea Woods Joliet
199 Anthony Campbell San Diego
200 Daniel Wagner Chicago

With weighting enabled you will see popular names like James, John, Mary, and Patricia appear more frequently, while unusual names surface only occasionally. Similarly, cities like New York, Los Angeles, and Chicago dominate the output while smaller cities appear less often.

The feature works by organizing locale data into four frequency tiers. Each tier has a sampling probability that determines how likely its members are to be selected:

Tier Probability Contents
very_common 45% The top ~10% of entries by real-world frequency
common 30% The next ~20% of entries
uncommon 20% The next ~30% of entries
rare 5% The remaining ~40% of entries

When a value is needed, a tier is first chosen according to these probabilities and then a single entry is picked uniformly at random within that tier. This two-step approach keeps sampling fast while producing a realistic long-tail distribution. Setting weighted=False pools all entries across every tier and samples them uniformly, which can be useful when you want an even spread rather than a realistic distribution.

Weighted sampling combines seamlessly with multi-country mixing. Each country’s batch uses its own tiered data independently, so a mixed dataset will have weighted US names alongside weighted German names:

pb.preview(
    pb.generate_dataset(
        schema,
        n=200,
        seed=23,
        country={"US": 0.6, "DE": 0.4},
        weighted=True,
    )
)
PolarsRows200Columns2
name
String
city
String
1 Paul Wood Memphis
2 Heribert Wolf Osnabrück
3 William Jackson Denver
4 Paul Scott Orlando
5 Betty Herrera South Bend
196 Angelika Voß Aachen
197 Margaret Garcia Dallas
198 Kevin Cooper Chicago
199 George Pierce Chicago
200 Friedrich Schilling Oldenburg

All 55 supported country locales have tiered name and location data, so weighted=True produces realistic frequency distributions for every country.

Output Formats

The generate_dataset() function supports multiple output formats via the output= parameter, making it easy to integrate with your preferred data processing library.

schema = pb.Schema(
    id=pb.int_field(min_val=1),
    name=pb.string_field(preset="name"),
)

The default output is a Polars DataFrame, which offers excellent performance and a modern API for data manipulation:

polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars")

pb.preview(polars_df)
PolarsRows100Columns2
id
Int64
name
String
1 7188536481533917197 Doris Martin
2 2674009078779859984 Nancy Gonzalez
3 7652102777077138151 Jessica Turner
4 157503859921753049 George Evans
5 2829213282471975080 Patricia Williams
96 7027508096731143831 Isaiah Murphy
97 6055996548456656575 Brittany Rodriguez
98 3822709996092631588 Megan Stevens
99 1522653102058131295 Pamela Jenkins
100 5690877051669225499 Stephanie Santos

If your workflow uses Pandas, simply specify output="pandas" to get a Pandas DataFrame:

pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas")

pb.preview(pandas_df)
PandasRows100Columns2
id
int64
name
str
1 7188536481533917197 Doris Martin
2 2674009078779859984 Nancy Gonzalez
3 7652102777077138151 Jessica Turner
4 157503859921753049 George Evans
5 2829213282471975080 Patricia Williams
96 7027508096731143831 Isaiah Murphy
97 6055996548456656575 Brittany Rodriguez
98 3822709996092631588 Megan Stevens
99 1522653102058131295 Pamela Jenkins
100 5690877051669225499 Stephanie Santos

Both formats work seamlessly with Pointblank’s validation functions, so you can choose whichever fits best with your existing data pipeline.

Using Generated Data for Validation Testing

A common use case is generating test data to validate your validation rules:

# Define a schema with constraints
schema = pb.Schema(
    user_id=pb.int_field(min_val=1, unique=True),
    email=pb.string_field(preset="email"),
    age=pb.int_field(min_val=18, max_val=100),
    status=pb.string_field(allowed=["active", "pending", "inactive"]),
)

# Generate test data
test_data = pb.generate_dataset(schema, n=100, seed=23)

# Validate the generated data (it should pass all checks)
validation = (
    pb.Validate(test_data)
    .col_vals_gt("user_id", 0)
    .col_vals_regex("email", r".+@.+\..+")
    .col_vals_between("age", 18, 100)
    .col_vals_in_set("status", ["active", "pending", "inactive"])
    .interrogate()
)

validation
Pointblank Validation
2026-02-18|22:21:49
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
user_id 0 100 100
1.00
0
0.00
#4CA64C 2
col_vals_regex
col_vals_regex()
email .+@.+\..+ 100 100
1.00
0
0.00
#4CA64C 3
col_vals_between
col_vals_between()
age [18, 100] 100 100
1.00
0
0.00
#4CA64C 4
col_vals_in_set
col_vals_in_set()
status active, pending, inactive 100 100
1.00
0
0.00

Since the generated data respects the constraints defined in the schema, it should pass all validation checks. This workflow is particularly useful for testing validation logic before applying it to production data, or for creating reproducible test fixtures in your CI/CD pipeline.

Pytest Fixture

When Pointblank is installed, a generate_dataset pytest fixture is automatically available in all your test files. There is no need to import anything or add configuration to conftest.py: the fixture is registered via pytest’s plugin system.

The fixture works identically to pb.generate_dataset(), but with one key difference: when you don’t supply a seed= parameter, a deterministic seed is automatically derived from the test’s fully-qualified name. This means:

  • the same test always produces the same data: no manual seed management required.
  • different tests get different seeds, so they exercise different datasets.
  • you can still pass an explicit seed= to override the automatic seed when needed.

Basic Usage

Use it by adding generate_dataset to your test function’s parameter list:

test_pipeline.py
import pointblank as pb

def test_etl_handles_nulls(generate_dataset):
    schema = pb.Schema(
        user_id=pb.int_field(unique=True),
        email=pb.string_field(preset="email", nullable=True, null_probability=0.3),
        age=pb.int_field(min_val=0, max_val=120),
    )

    df = generate_dataset(schema, n=500)
    result = my_etl_pipeline(df)
    assert result.filter(pl.col("email").is_null()).shape[0] == 0

All parameters from generate_dataset() are supported: n=, seed=, output=, and country=:

def test_german_data(generate_dataset):
    schema = pb.Schema(
        name=pb.string_field(preset="name"),
        city=pb.string_field(preset="city"),
    )

    df = generate_dataset(schema, n=200, country="DE", output="pandas")
    assert len(df) == 200

Multiple Datasets in One Test

Calling the fixture multiple times within the same test produces different (but still deterministic) data on each call:

def test_merge_pipeline(generate_dataset):
    customers = generate_dataset(customer_schema, n=1000, country="US")
    orders = generate_dataset(order_schema, n=5000)

    # Each call gets a unique seed derived from the test name + call index,
    # so both DataFrames are deterministic and different from each other.
    result = merge_pipeline(customers, orders)
    assert result.shape[0] > 0

Testing Across Locales

The fixture makes locale testing particularly concise when combined with pytest.mark.parametrize:

import pytest
import pointblank as pb

@pytest.mark.parametrize("country", ["US", "DE", "JP", "BR"])
def test_name_normalizer(generate_dataset, country):
    schema = pb.Schema(name=pb.string_field(preset="name_full"))
    df = generate_dataset(schema, n=100, country=country)
    result = normalize_names(df)
    assert result["name"].str.len_chars().min() > 0

Sharing Schemas Across Tests

Define schemas as fixtures in conftest.py and compose them with generate_dataset:

conftest.py
import pytest
import pointblank as pb

@pytest.fixture
def customer_schema():
    return pb.Schema(
        id=pb.int_field(unique=True),
        name=pb.string_field(preset="name"),
        email=pb.string_field(preset="email"),
        city=pb.string_field(preset="city"),
    )
test_validation.py
def test_customer_validation(generate_dataset, customer_schema):
    df = generate_dataset(customer_schema, n=200, country="DE")
    validation = pb.Validate(df).col_vals_not_null(columns="email").interrogate()
    assert validation.all_passed()
test_export.py
def test_customer_export(generate_dataset, customer_schema):
    df = generate_dataset(customer_schema, n=50, country="JP")
    exported = export_to_parquet(df)
    assert exported.exists()

Debugging with Seed Introspection

The fixture callable exposes two attributes that make debugging failed tests straightforward:

  • generate_dataset.default_seed: the base seed derived from the test name (available before any call)
  • generate_dataset.last_seed: the seed actually used for the most recent call (accounts for the call counter and explicit overrides)

Include .last_seed in assertion messages so failures are immediately reproducible:

def test_age_range(generate_dataset):
    schema = pb.Schema(age=pb.int_field(min_val=18, max_val=100))
    df = generate_dataset(schema, n=500)
    min_age = df["age"].min()
    assert min_age >= 18, (
        f"Expected min age >= 18, got {min_age} (seed={generate_dataset.last_seed})"
    )

You can also use .default_seed to reproduce the exact dataset outside of pytest:

# In a REPL or notebook, reproduce the data from a failed test:
import pointblank as pb
df = pb.generate_dataset(schema, n=500, seed=<default_seed_from_output>)

Seed Stability

A given seed (whether explicit or auto-derived) is guaranteed to produce identical output within the same Pointblank version. Across versions, changes to country data files or generator logic may alter the output for a given seed.

For CI pipelines that require bit-exact data across library upgrades, we recommend saving generated DataFrames as Parquet or CSV snapshot files rather than relying on cross-version seed stability. This is the same approach used by snapshot-testing tools like pytest-snapshot and syrupy.

Conclusion

Test data generation provides a convenient way to create realistic synthetic datasets directly from schema definitions. While the concept is straightforward (defining field types and constraints, then generating matching data), the feature can be invaluable in many development and testing workflows. By incorporating test data generation into your process, you can:

  • quickly prototype validation rules before working with production data
  • create reproducible test fixtures for automated testing and CI/CD pipelines
  • generate locale-specific data for internationalization testing across 55 countries
  • ensure coherent relationships between related fields like names, emails, addresses, jobs, and license plates
  • produce datasets of any size with consistent, realistic values

Whether you’re building validation logic, testing data pipelines, or simply need sample data for development, the schema-based generation approach gives you precise control over data characteristics while maintaining the realism needed to uncover edge cases and validate your assumptions about data quality.