Post

🐘 PostgreSQL κΈ°λ³Έ β…₯ - SQL을 μ‚¬μš©ν•œ κΈ°λ³Έ μˆ˜ν•™ 및 톡계

🐘 PostgreSQL κΈ°λ³Έ β…₯ - SQL을 μ‚¬μš©ν•œ κΈ°λ³Έ μˆ˜ν•™ 및 톡계

πŸ“™ γ€Žμ‹€μš© SQL』을 읽고 μ •λ¦¬ν•œ κΈ€μž…λ‹ˆλ‹€.

μˆ˜ν•™ μ—°μ‚°μžμ™€ ν•¨μˆ˜ μ΄ν•΄ν•˜κΈ°

μˆ˜ν•™κ³Ό 데이터 νƒ€μž… μ΄ν•΄ν•˜κΈ°

  • 두 숫자 사이에 λ§μ…ˆκ³Ό λΊ„μ…ˆ, κ³±μ…ˆ, λ‚˜λˆ—μ…ˆ μ—°μ‚°μžλ₯Ό μ‚¬μš©ν•˜λ©΄ λ°˜ν™˜λ˜λŠ” κ°’μ˜ 데이터 νƒ€μž…μ€ λ‹€μŒκ³Ό κ°™λ‹€.
  1. 두 μ •μˆ˜λŠ” integerλ₯Ό λ°˜ν™˜ν•œλ‹€.
  2. μ—°μ‚°μž μ˜†μ— numeric νƒ€μž…μΈ μˆ«μžκ°€ ν•˜λ‚˜λΌλ„ 있으면 numeric을 λ°˜ν™˜ν•œλ‹€.
  3. 뢀동 μ†Œμˆ˜μ  μˆ«μžκ°€ 있으면 뢀동 μ†Œμˆ˜μ  νƒ€μž…μΈ double precision을 λ°˜ν™˜ν•œλ‹€.
  • κ·ΈλŸ¬λ‚˜ μ§€μˆ˜, 제곱근, νŒ©ν† λ¦¬μ–Ό ν•¨μˆ˜λŠ” λ‹€λ₯΄λ‹€.
  • 각각은 μ—°μ‚°μž μ•žμ΄λ‚˜ 뒀에 ν•˜λ‚˜μ˜ 숫자λ₯Ό μ·¨ν•˜κ³ , μž…λ ₯이 μ •μˆ˜μΈ κ²½μš°μ—λ„ numericκ³Ό floating-point νƒ€μž…μ„ λ°˜ν™˜ν•œλ‹€.
  • 데이터 νƒ€μž…μ΄ μ˜λ„ν•œ 바와 λ‹€λ₯΄λ‹€λ©΄ CASTλ₯Ό 톡해 데이터 νƒ€μž…μ„ λ³€ν™˜ν•˜μ—¬ μ‚¬μš©ν•˜λ©΄ λœλ‹€.

λ”ν•˜κΈ°, λΉΌκΈ°, 그리고 κ³±ν•˜κΈ°

1
2
3
SELECT 2 + 2;
SELECT 9 - 1;
SELECT 3 * 4;

μ—¬κΈ°μ„œ μš°λ¦¬κ°€ 열을 λ”°λ‘œ μ§€μ •ν•˜μ§€ μ•Šμ•˜κΈ° λ•Œλ¬Έμ— λ‹€μŒκ³Ό 같이 λ―Έμƒμ˜ 열을 λœ»ν•˜λŠ” ?column? μ•„λž˜μ— λ‚˜νƒ€λ‚œλ‹€.

?column?
12
  • μ—΄ 이름을 ν‘œμ‹œν•˜λ €λ©΄ SELECT 3 * 4 AS result와 같이 별칭을 μ‚¬μš©ν•΄μ•Ό ν•œλ‹€.

λ‚˜λˆ„κΈ°μ™€ λͺ¨λ“ˆλŸ¬ μ—°μ‚° μˆ˜ν–‰ν•˜κΈ°

1
2
3
4
SELECT 11 / 6;
SELECT 11 % 6;
SELECT 11.0 / 6;
SELECT CAST(11 AS numeric(3,1)) / 6;
  • 참고둜 ν•œ λ²ˆμ— λͺ«κ³Ό λ‚˜λ¨Έμ§€λ₯Ό μ œκ³΅ν•˜λŠ” 연산은 μ‘΄μž¬ν•˜μ§€ μ•ŠλŠ”λ‹€.
  • 두 숫자λ₯Ό λ‚˜λˆ„κ³  κ²°κ³Όκ°€ numeric νƒ€μž…μœΌλ‘œ λ°˜ν™˜λ˜λ„λ‘ ν•˜λ €λ©΄ ν”Όμ—°μ‚°μž 쀑 ν•˜λ‚˜λΌλ„ numeric νƒ€μž…μ„ 갖도둝 μΊμŠ€νŒ…ν•˜λ©΄ λœλ‹€.

μ§€μˆ˜, 제곱근, νŒ©ν† λ¦¬μ–Ό μ‚¬μš©ν•˜κΈ°

1
2
3
4
5
6
7
8
SELECT 3 ^ 4;

SELECT |/ 10;
SELECT sqrt(10);
SELECT ||/ 10;

SELECT factorial(4);
SELECT 4 !;
  • μ§€μˆ˜ μ—°μ‚°μž ^λ₯Ό μ΄μš©ν•˜λ©΄ μ£Όμ–΄μ§„ 기수λ₯Ό μ§€μˆ˜λ‘œ 올릴 수 μžˆλ‹€.
  • 숫자의 μ œκ³±κ·Όμ€ |/ μ—°μ‚°μž ν˜Ήμ€ sqrt()λ₯Ό μ‚¬μš©ν•œλ‹€.
  • ||/λŠ” 숫자의 μ„Έμ œκ³±κ·Όμ„ κ΅¬ν•˜κΈ° μœ„ν•œ 연산이닀.
  • 숫자의 νŒ©ν† λ¦¬μ–Όμ„ κ³„μ‚°ν•˜λ €λ©΄ factorial() ν˜Ήμ€ ! μ—°μ‚°μžλ₯Ό μ‚¬μš©ν•˜λ©΄ λœλ‹€.
  • SQL ν‘œμ€€μ˜ 일뢀가 μ•„λ‹ˆλ©° PostgreSQLμ—λ§Œ ν•΄λ‹Ήλœλ‹€.

μ—°μ‚°μ˜ μˆœμ„œ μœ μ˜ν•˜κΈ°

  • μ§€κΈˆκΉŒμ§€ μ„€λͺ…ν•œ PostgreSQL의 μ—°μ‚° μˆœμ„œλŠ” λ‹€μŒκ³Ό κ°™λ‹€.
  1. μ§€μˆ˜μ™€ κ·Ό
  2. κ³±ν•˜κΈ°, λ‚˜λˆ„κΈ°, λͺ¨λ“ˆλŸ¬
  3. λ”ν•˜κΈ°, λΉΌκΈ°
  • ν›„μˆœμœ„μ— μžˆλŠ” 연산을 λ¨Όμ € μ²˜λ¦¬ν•˜κ³  μ‹Άλ‹€λ©΄ κ΄„ν˜Έλ₯Ό ν™œμš©ν•΄μ•Ό ν•œλ‹€.
  • λ‚˜μ€‘μ— 뢄석을 μˆ˜μ •ν•  ν•„μš”κ°€ 없도둝 μ—°μ‚°μž μš°μ„  μˆœμœ„μ— μœ μ˜ν•΄μ•Ό ν•œλ‹€.

인ꡬ쑰사 ν…Œμ΄λΈ” 열을 μ΄μš©ν•΄ κ³„μ‚°ν•˜κΈ°

1
2
3
4
5
6
7
8
9
10
SELECT county_name AS county,
	   state_name AS state,
	   pop_est_2019 AS pop,
	   births_2019 AS births,
	   deaths_2019 AS deaths,
	   international_migr_2019 AS int_migr,
	   domestic_migr_2019 AS dom_migr,
	   residual_2019 AS residual
FROM us_counties_pop_est_2019
LIMIT 4;
countystatepopbirthsdeathsint_migrdom_migrresidual
Autauga CountyAlabama55,869624541-16270-1
Baldwin CountyAlabama223,2342,3042,326805,29724
Barbour CountyAlabama24,68625631213-141-2
Bibb CountyAlabama22,3942402521031-2

열끼리 λ”ν•˜κ³  λΉΌκΈ°

1
2
3
4
5
6
7
SELECT county_name AS county,
	   state_name AS state,
	   births_2019 AS births,
	   deaths_2019 AS deaths,
	   births_2019 - deaths_2019 AS natural_increase
FROM us_counties_pop_est_2019
ORDER BY state_name, county_name;
  • μœ„ μ½”λ“œλŠ” 각 μΉ΄μš΄ν‹°μ˜ μΆœμƒμž μˆ˜μ—μ„œ μ‚¬λ§μž 수λ₯Ό λΉΌ 인ꡬ μ‘°μ‚¬μ—μ„œ μžμ—° μ¦κ°€ν•˜λŠ” 수λ₯Ό κ΅¬ν•œλ‹€.
countystatebirthsdeathsnatural_increase
Autauga CountyAlabama62454183
Baldwin CountyAlabama2,3042,326-22
Barbour CountyAlabama256312-56
Bibb CountyAlabama240252-12
  • 이제 이λ₯Ό 기반으둜 데이터λ₯Ό ν…ŒμŠ€νŠΈν•˜κ³  열을 μ˜¬λ°”λ₯΄κ²Œ κ°€μ Έμ™”λŠ”μ§€ 확인해야 ν•œλ‹€.
  • 2019λ…„ 인ꡬ μΆ”μ •μΉ˜λŠ” 2018λ…„ μΆ”μ •μΉ˜μ™€ μΆœμƒμž 수, μ‚¬λ§μž 수, 이주 및 μž”μ—¬ μš”μΈμ— λŒ€ν•œ μ—΄μ˜ 합계와 κ°™μ•„μ•Ό ν•œλ‹€.
1
2
3
4
5
6
7
8
9
10
11
SELECT county_name AS county,
	state_name AS state,
	pop_est_2019 AS pop,
	pop_est_2018 + births_2019 - deaths_2019 +
		international_migr_2019 + domestic_migr_2019 +
		residual_2019 AS components_total,
	pop_est_2019 - (pop_est_2018 + births_2019 - deaths_2019 +
		international_migr_2019 + domestic_migr_2019 +
		residual_2019) AS difference
FROM us_counties_pop_est_2019
ORDER BY difference DESC;
  • pop 열은 2019λ…„μ˜ 인ꡬ μΆ”μ •μΉ˜μ΄κ³ , components_total 열은 2018λ…„ 인ꡬ μΆ”μ •μΉ˜μ— 인ꡬ λ³€ν™” μš”μ†Œλ₯Ό λ”ν•œ 값이닀.
  • differenceλŠ” pop - components_total와 κ°™λ‹€.
  • λ‹€μŒκ³Ό 같이 differenct 값이 0으둜 λͺ¨λ“  행에 좜λ ₯λ˜λ―€λ‘œ, κ°€μ Έμ˜¨ 데이터가 κΉ¨λ—ν•˜λ‹€κ³  ν•  수 μžˆλ‹€.
countystatepopcomponents_totaldifference
Autauga CountyAlabama55,86955,8690
Baldwin CountyAlabama223,234223,2340
Barbour CountyAlabama24,68624,6860
Bibb CountyAlabama22,39422,3940
  • μƒˆλ‘œμš΄ 데이터셋을 κ°€μ Έμ˜€λ©΄ μ΄λ ‡κ²Œ μž‘μ€ ν…ŒμŠ€νŠΈλ₯Ό μˆ˜ν–‰ν•˜λŠ” 것이 μ€‘μš”ν•˜λ‹€.
  • 뢄석을 μ‹œμž‘ν•˜κΈ° 전에 데이터λ₯Ό 잘 μ΄ν•΄ν•˜κ³  잠재적인 문제λ₯Ό λ°©μ§€ν•  수 μžˆλ‹€.

λ°μ΄ν„°μ˜ 전체 λ°±λΆ„μœ¨ κ΅¬ν•˜κΈ°

1
2
3
4
5
SELECT county_name AS county,
	state_name AS state,
	area_water::numeric / (area_land + area_water) * 100 AS pct_water
FROM us_counties_pop_est_2019
ORDER BY pct_water DESC;
  • μΉ΄μš΄ν‹°μ—μ„œ 물이 μ°¨μ§€ν•˜λŠ” 면적 λΉ„μœ¨ κ΅¬ν•˜λŠ” 쿼리문이닀.
  • 데이터λ₯Ό μ €μž₯λ˜μ–΄ μžˆλŠ” κ·ΈλŒ€λ‘œ integer νƒ€μž…μœΌλ‘œ μ‚¬μš©ν•˜λ©΄ μ›ν•˜λŠ” κ²°κ³Όλ₯Ό μ–»μ§€ λͺ»ν•œλ‹€.
  • μ •μˆ˜λ₯Ό μ •μˆ˜λ‘œ λ‚˜λˆ„κΈ° λ•Œλ¬Έμ— λͺ¨λ“  ν–‰μ˜ 결과둜 0이 λ‚˜μ˜€κ²Œ λœλ‹€.
  • λŒ€μ‹  μ •μˆ˜ 쀑 ν•˜λ‚˜λ₯Ό numeric νƒ€μž…μœΌλ‘œ λ³€ν™˜ν•˜λ©΄ 결과도 μ†Œμˆ˜λ‘œ λ‚˜μ˜¨λ‹€.
  • μœ„ λͺ…λ Ήλ¬Έμ—μ„œλŠ” μ½”λ“œλ₯Ό 짧게 ν•˜κΈ° μœ„ν•΄ :: μ—°μ‚°μžλ₯Ό μ‚¬μš©ν–ˆλ‹€.
  • κ²°κ³Όλ₯Ό ν‘œμ‹œν•˜κΈ° μœ„ν•΄ 100을 κ³±ν•˜λ©΄ λͺ¨λ‘κ°€ μ•Œκ³  μžˆλŠ” λ°±λΆ„μœ¨μ΄ ν‘œμ‹œλœλ‹€.
countystatepct_water
Keweenaw CountyMichigan90.9472374745
Leelanau CountyMichigan86.2885896812
Nantucket CountyMassachusetts84.7969249919
St. Bernard ParishLouisiana82.483711492

λ³€ν™”μœ¨ κ³„μ‚°ν•˜κΈ°

  • 데이터 λΆ„μ„μ˜ 또 λ‹€λ₯Έ 핡심 μ§€ν‘œλŠ” μ‹œκ°„μ— λ”°λ₯Έ λ³€ν™”λ₯Ό λ‚˜νƒ€λ‚΄λŠ” λ³€ν™”μœ¨μ΄λ‹€.
  • (new_value - prev_value) / prev_value와 같이 κ³„μ‚°ν•œλ‹€.
  • κ·Έ μ˜ˆμ‹œλŠ” λ‹€μŒκ³Ό κ°™λ‹€.
  1. μžλ™μ°¨ μ œμ‘°μ‚¬ 별 판맀 λŒ€μˆ˜ μ „λ…„ 동기 λŒ€λΉ„ λ³€ν™”
  2. λ§ˆμΌ€νŒ… νšŒμ‚¬κ°€ μš΄μ˜ν•˜λŠ” 메일링 리슀트 μ›”κ°„ κ΅¬λ…μž 수 λ³€ν™”
  3. μ „κ΅­ ν•™κ΅μ˜ μ—°κ°„ 등둝생 수 증감
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
CREATE TABLE percent_change (
	department text,
	spend_2019 numeric(10,2),
	spend_2022 numeric(10,2)
);

INSERT INTO percent_change
VALUES
	('Assessor', 178556, 179500),
	('Building', 250000, 289000),
	('Clerk', 451980, 650000),
	('Library', 87777, 90001),
	('Parks', 250000, 223000),
	('Water', 199000, 195000);

SELECT department,
	   spend_2019,
	   spend_2022,
	   round( (spend_2022 - spend_2019) /
					spend_2019 * 100, 1) AS pct_change
FROM percent_change;
  • κ°€μƒμ˜ μ§€λ°© μ •λΆ€ λΆ€μ„œμ—μ„œμ˜ μ§€μΆœκ³Ό κ΄€λ ¨λœ ν…ŒμŠ€νŠΈ 데이터에 λŒ€ν•œ λ³€ν™”μœ¨μ„ μ‘°νšŒν•˜λŠ” λͺ…령문이닀.
  • μœ„ μ½”λ“œμ—μ„œλŠ” μ†Œμˆ˜μ  μ•„λž˜ ν•œ 자리만 좜λ ₯ν•˜λ„λ‘ round() ν•¨μˆ˜λ₯Ό μ‚¬μš©ν–ˆλ‹€.
departmentspend_2019spend_2022pct_change
Assessor178,556179,5000.5
Building250,000289,00015.6
Clerk451,980650,00043.8
Library87,77790,0012.5
Parks250,000223,000-10.8
Water199,000195,000-2

평균 및 총합 집계 ν•¨μˆ˜ μ‚¬μš©ν•˜κΈ°

1
2
3
SELECT sum(pop_est_2019) AS county_sum,
	round(avg(pop_est_2019), 0) AS county_average
FROM us_counties_pop_est_2019;
  • SQLμ—μ„œλŠ” λ™μΌν•œ μ—΄ λ‚΄μ˜ 값듀을 λͺ¨μ•„ 계산할 μˆ˜λ„ μžˆλ‹€.
  • κ°€μž₯ 많이 μ‚¬μš©λ˜λŠ” ν•¨μˆ˜λŠ” avg() λ˜λŠ” sum()이닀.
  • κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™λ‹€.
county_sumcounty_average
328,239,523104,468

쀑앙값 μ°ΎκΈ°

  • 숫자 μ§‘ν•©μ˜ 쀑앙값은 평균 λ§ŒνΌμ΄λ‚˜ μ€‘μš”ν•œ μ§€ν‘œλ‹€.
  • 평균값과 μ€‘μ•™κ°’μ˜ μ •μ˜λŠ” λ‹€μŒκ³Ό κ°™λ‹€.
  1. 평균값: λͺ¨λ“  κ°’μ˜ κ°’μ˜ 개수둜 λ‚˜λˆˆ κ°’
  2. 쀑앙값: μ •λ ¬λœ κ°’ μ§‘ν•©μ˜ 쀑간 κ°’
  • 쀑앙값이 μ€‘μš”ν•œ μ΄μœ λŠ” 그것이 특이치의 영ν–₯을 κ°μ†Œμ‹œν‚€κΈ° λ•Œλ¬Έμ΄λ‹€.
  • κ°€λ Ή 6λͺ…이 μ²΄ν—˜ ν•™μŠ΅μ„ κ°„λ‹€κ³  κ°€μ •ν•˜μž.
  • (10 + 11 + 10 + 9 + 13 + 12) / 6 = 10.8
  • 연령이 쒁은 λ²”μœ„ 내에 λΆ„ν¬λ˜μ–΄ 있기 λ•Œλ¬Έμ— μœ„ 평균값은 그룹을 잘 λŒ€ν‘œν•œλ‹€.
  • λ§Œμ•½ λ‚˜μ΄ λ§Žμ€ λ³΄ν˜Έμžκ°€ μ°Έκ°€ν•œλ‹€λ©΄ 이야기가 달라진닀.
  • (10 + 11 + 10 + 9 + 13 + 12 + 46) / 7 = 15.9``
  • 46μ΄λΌλŠ” νŠΉμ΄μΉ˜κ°€ 그룹을 μ™œκ³‘ν•˜μ—¬ μ΄μ œλŠ” 평균값은 μ‹ λ’°ν•  수 μ—†λŠ” μ§€ν‘œκ°€ λ˜μ—ˆλ‹€.
  • 이럴 λ•Œ 쀑앙값이 μš”κΈ΄ν•˜κ²Œ 쓰인닀.
  • μ •λ ¬λœ λͺ©λ‘μ—μ„œ 쀑간에 μžˆλŠ” 값은 11둜, 그룹의 μ—°λ Ή 뢄포λ₯Ό κ°μ•ˆν•˜λ©΄ 평균값 15.9λ³΄λ‹€λŠ” 쀑앙값 11이 κ·Έλ£Ή λ‚΄μ˜ 일반적인 연령을 λ”μš± 잘 보여쀀닀.
  • 쒋은 ν…ŒμŠ€νŠΈλŠ” κ°’ 그룹에 λŒ€ν•œ 평균과 쀑앙값을 κ³„μ‚°ν•˜λŠ” 것이닀.
  • 두 값이 κ°€κΉŒμš°λ©΄ 그룹이 μ •κ·œ 뢄포λ₯Ό λ”°λ₯΄λ‹ˆ 평균이 μœ μš©ν•˜κ³ , μ•„λ‹ˆλΌλ©΄ 쀑앙값이 더 λ‚˜μ€ ν‘œν˜„μ΄λ‹€.
  • 참고둜 λ°μ΄ν„°μ˜ κ°œμˆ˜κ°€ 짝수 κ°œμ—¬μ„œ 논리적인 쀑앙값이 두 개라면, κ·Έ 두 κ°’μ˜ 평균이 쀑앙값이 λœλ‹€.

λ°±λΆ„μœ„μˆ˜ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 쀑앙값 μ°ΎκΈ°

  • λŒ€λΆ€λΆ„μ˜ κ΄€κ³„ν˜• λ°μ΄ν„°λ² μ΄μŠ€μ™€ λ§ˆμ°¬κ°€μ§€λ‘œ PostgreSQLμ—μ„œλŠ” 쀑앙값을 λ°˜ν™˜ν•˜λŠ” median() ν•¨μˆ˜κ°€ λ‚΄μž₯λ˜μ–΄ μžˆμ§€ μ•Šλ‹€.
  • λŒ€μ‹  μš°λ¦¬λŠ” SQL λ°±λΆ„μœ„μˆ˜ ν•¨μˆ˜λ₯Ό μ΄μš©ν•˜μ—¬ 쀑앙값을 μ°Ύκ³  λΆ„μœ„μˆ˜ λ˜λŠ” μ ˆλ‹¨μ μ„ μ΄μš©ν•΄ 숫자 그룹을 λ™μΌν•œ 크기둜 λ‚˜λˆŒ 수 μžˆλ‹€.
  • ν†΅κ³„μ—μ„œ λ°±λΆ„μœ„μˆ˜λŠ” μ •λ ¬λœ 데이터 λ‚΄μ—μ„œ νŠΉμ • λΉ„μœ¨μ˜ 데이터가 λ°œκ²¬λ˜λŠ” 지점을 λ‚˜νƒ€λ‚Έλ‹€.
  • 쀑앙값은 50번째 λ°±λΆ„μœ„μˆ˜κ°€ λ™μΌν•˜λ‹€.
  • percentile_cont(n), percentile_disc(n)λΌλŠ” 두 λ²„μ „μ˜ λ°±λΆ„μœ„μˆ˜ ν•¨μˆ˜κ°€ μžˆλ‹€.
  • 두 ν•¨μˆ˜ λͺ¨λ‘ ANSI SQL의 일뢀이닀.
1
2
3
4
5
6
7
8
9
10
11
12
13
CREATE TABLE percentile_test (
	numbers integer
);

INSERT INTO percentile_test (numbers) VALUES
	(1), (2), (3), (4), (5), (6);

SELECT
	percentile_cont(.5)
	WITHIN GROUP (ORDER BY numbers),
	percentile_disc(.5)
	WITHIN GROUP (ORDER BY numbers)
FROM percentile_test;
  • percentile_cont() ν•¨μˆ˜λŠ” λ°±λΆ„μœ„μˆ˜λ₯Ό 연속 κ°’μœΌλ‘œ κ³„μ‚°ν•˜κΈ° λ•Œλ¬Έμ— κ²°κ³Όκ°€ μ§‘ν•© λ‚΄μ˜ μˆ«μžλ“€λ‘œλ§Œ ν‘œμ‹œλ˜μ§€λŠ” μ•ŠλŠ”λ‹€.
  • 예λ₯Ό λ“€μ–΄ 짝수 개의 쀑앙값을 κ³„μ‚°ν•˜λ“―, 두 쀑간 숫자의 평균을 λ‚˜νƒ€λ‚Έλ‹€.
  • 반면 percentile_disc() ν•¨μˆ˜λŠ” μ§‘ν•©μ˜ 숫자 쀑 ν•˜λ‚˜λ‘œ 반올림 λœλ‹€.
  • κ²°κ³ΌλŠ” λ‹€μŒκ³Ό κ°™λ‹€.
percentile_contpercentile_disc
3.53
  • 상황에 따라 λ‹€λ₯΄κ² μ§€λ§Œ, λ³Έλ”” 쀑앙값 κ°œλ…μ— μΆ©μ‹€ν•œ percentile_cont() ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜λŠ” 것이 ꢌμž₯λœλ‹€.

인ꡬ 쑰사 λ°μ΄ν„°λ‘œ 쀑앙값 및 λ°±λΆ„μœ„μˆ˜ κ³„μ‚°ν•˜κΈ°

1
2
3
4
5
SELECT sum(pop_est_2019) AS county_sum,
	round(avg(pop_est_2019), 0) AS county_average,
	percentile_cont(.5)
	WITHIN GROUP (ORDER BY pop_est_2019) AS county_median
FROM us_counties_pop_est_2019;
county_sumcounty_averagecounty_median
328,239,523104,46825,726
  • 쀑앙값과 평균값이 멀리 λ–¨μ–΄μ Έ μžˆλŠ”λ°, μ΄λŠ” 평균값이 μ˜€λ„λ  수 μžˆμŒμ„ μ˜λ―Έν•œλ‹€.
  • λ‹¨μˆœνžˆ μ§‘κ³„μΉ˜λ₯Ό μ œκ³΅ν•  것이 μ•„λ‹ˆλΌ, μ˜λ―ΈμžˆλŠ” μ§‘κ³„μΉ˜λ₯Ό μ œκ³΅ν•˜λŠ” 것이 μ˜¬λ°”λ₯΄λ‹€.

λ°±λΆ„μœ„μˆ˜ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜λ©° λ‹€λ₯Έ λΆ„μœ„μˆ˜ μ°ΎκΈ°

  • 데이터λ₯Ό λ™μΌν•œ 크기의 더 μž‘μ€ κ·Έλ£Ήλ“€λ‘œ λΆ„ν• ν•  μˆ˜λ„ μžˆλ‹€.
  • κ°€μž₯ 일반적인 것은 μ‚¬λΆ„μœ„μˆ˜, μ˜€λΆ„μœ„μˆ˜, μ‹­λΆ„μœ„μˆ˜μ΄λ‹€.
  • κ°œλ³„ 값을 찾으렀면 percentile_cont(.25)와 같이 λ°±λΆ„μœ„μˆ˜ ν•¨μˆ˜μ— μ—°κ²°ν•˜κΈ°λ§Œ ν•˜λ©΄ λœλ‹€.
  • κ·ΈλŸ¬λ‚˜ μ—¬λŸ¬ 개의 μ ˆλ‹¨μ μ„ μƒμ„±ν•˜λ €λŠ” κ²½μš°μ—λŠ” 배열을 톡해 λ‹€μŒκ³Ό 같이 μ‚¬μš©ν•  수 μžˆλ‹€.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- μ‚¬λΆ„μœ„μˆ˜
SELECT percentile_cont(ARRAY[.25,.5,.75])
	WITHIN GROUP (ORDER BY pop_est_2019) AS quartiles
FROM us_counties_pop_est_2019;

-- μ˜€λΆ„μœ„μˆ˜
SELECT percentile_cont(ARRAY[.2,.4,.6,.8])
	WITHIN GROUP (ORDER BY pop_est_2019) AS quintiles
FROM us_counties_pop_est_2019;

-- μ‹­λΆ„μœ„μˆ˜
SELECT percentile_cont(ARRAY[.1,.2,.3,.4,.5,.6,.7,.8,.9])
	WITHIN GROUP (ORDER BY pop_est_2019) AS deciles
FROM us_counties_pop_est_2019;
  • 배열은 ANSI SQL ν‘œμ€€μ΄λ©°, μ—¬κΈ°μ—μ„œ μ‚¬μš©ν•˜λŠ” 문법은 PostgreSQLμ—μ„œ 배열을 μ‚¬μš©ν•˜λŠ” μ—¬λŸ¬ 방법 쀑 ν•˜λ‚˜μ΄λ‹€.
1
2
3
4
5
SELECT unnest(
	percentile_cont(ARRAY[.25,.5,.75])
	WITHIN GROUP (ORDER BY pop_est_2019)
	) AS quartiles
FROM us_counties_pop_est_2019;
  • unnest() ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 배열을 μ—¬λŸ¬ 개의 ν–‰μœΌλ‘œ 좜λ ₯ν•  μˆ˜λ„ μžˆλ‹€.

μ΅œλΉˆκ°’ μ°ΎκΈ°

1
2
SELECT mode() WITHIN GROUP (ORDER BY births_2019)
FROM us_counties_pop_est_2019;
  • PostgreSQLμ—μ„œλŠ” mode() ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ μ—΄μ—μ„œ κ°€μž₯ 많이 λ“±μž₯ν•˜λŠ” 값인 μ΅œλΉˆκ°’μ„ ꡬ할 수 μžˆλ‹€.
  • WITHIN GROUP (ORDER BY ...)λŠ” 집계 ν•¨μˆ˜κ°€ 계산에 μ‚¬μš©ν•  κ°’μ˜ μˆœμ„œλ₯Ό μ§€μ •ν•˜λŠ” 것이닀.
  • κ°€λ Ή 데이터가 [1, 2, 2, 3, 3]일 λ•Œ 2와 3에 동λ₯ μ΄ μ‘΄μž¬ν•˜λŠ”λ°, WITHIN GROUP (ORDER BY column ASC)라면 2κ°€ λ‚˜μ˜¬ 것이고, WITHIN GROUP (ORDER BY column DESC)라면 3이 λ‚˜μ˜¬ 것이닀.
This post is licensed under CC BY 4.0 by the author.