matplotlib의 플롯팅 함수 사용하기

matplotlib은 numpy나 pandas를 사용하여 데이터를 분석한 결과를 시각화하는 데 사용되는 대표적인 Python 데이터 시각화 라이브러리입니다. matplotlib에서는 DataFrame 혹은 Series 형태의 데이터를 가지고 다양한 형태의 플롯을 만들어 주는 기능을 지원합니다.

IPython Notebook에서 플롯을 그리기에 앞서, %matplotlib라는 매직 명령어를 사용해서 플롯팅 옵션을 먼저 지정해야 합니다. %matplotlib nbagg를 실행하는 경우, 노트북 상에서 생성되는 플롯을 인터랙티브하게 조작할 수 있습니다. 한편 %matplotlib inline을 실행하면, 노트북 상의 특정 셀에서 플롯을 일단 생성하면 이를 조작할 수 없습니다.

본 강의에서는 %matplotlib nbagg을 실행하여 적용합니다.

%matplotlib nbagg

라인 플롯(line plot)

라인 플롯은 연속적인 직선으로 구성된 플롯입니다. 어떤 특정한 독립변수 X가 변화함에 따라 종속변수 Y가 어떻게 변화하는지를 나타내고자 할 때 라인 플롯을 사용합니다.

랜덤한 값들로 구성된 Series s를 인덱스와 함께 생성한 뒤 s.plot()을 실행하면, s의 인덱스와 값을 사용하여 라인 플롯을 그려줍니다. 만약 여러분이 import한 matplotlib.pyplot 모듈 plt를 사용하여 plt.plot(s)를 실행하더라도 동일한 결과를 얻을 수 있습니다.

s = pd.Series(np.random.randn(10).cumsum(), index=np.arange(0, 100, 10))
s.plot()

생성된 플롯에 대하여, 화면에 표시된 부분을 이동시키거나 특정 부분을 확대하며, 현재 보여진 플롯을 이미지 파일 형태로 내보내는 등의 인터랙티브한 조작을 가할 수 있습니다. 플롯 우측 상단의 파란색 버튼을 클릭하게 되면, 해당 플롯을 더 이상 수정할 수 없는 상태가 됩니다.

랜덤한 값들로 구성된 DataFrame df을 인덱스, 컬럼과 함께 생성한 뒤 df.plot()을 실행하면, df의 인덱스와 각 컬럼 값을 사용하여 여러 개의 라인 플롯을 그려줍니다. df에 포함되어 있는 열의 갯수만큼 라인 플롯이 화면에 그려진 것을 확인할 수 있습니다. 만약 여러분이 import한 matplotlib.pyplot 모듈 plt를 사용하여 plt.plot(df)를 실행하더라도 동일한 결과를 얻을 수 있습니다.

df = pd.DataFrame(np.random.randn(10, 4).cumsum(axis=0),
                  columns=["A", "B", "C", "D"],
                  index=np.arange(0, 100, 10))
df.plot()

만약 특정한 하나의 열에 대해서만 라인 플롯을 그리고 싶다면, 다음과 같이 해당 열을 Series 형태로 추출한 뒤 라인 플롯을 그리면 됩니다.

df["B"].plot()

바 플롯(bar plot)

바 플롯은 막대 형태의 플롯입니다. 독립변수 X가 변화하면서 종속변수 Y가 변화하는 양상을 나타낼 때, X가 연속적인 숫자에 해당하는 경우 라인 플롯을 그렸다면, X가 유한 개의 값만을 가질 경우 바 플롯을 사용하면 유용합니다.

랜덤한 값들로 구성된 Series s2를 인덱스와 함께 생성한 뒤 s2.plot(kind="bar")를 실행하면, s2의 인덱스와 값을 사용하여 수직 방향의 바 플롯을 그려줍니다.

s2 = pd.Series(np.random.rand(16), index=list("abcdefghijklmnop"))
s2.plot(kind="bar")

만약 바 플롯을 수평 방향으로 그리고자 할 경우, s2.plot(kind="barh")를 실행하면 됩니다.

s2.plot(kind="barh")

랜덤한 값들로 구성된 DataFrame df2를 인덱스, 컬럼과 함께 생성한 뒤 df2.plot(kind="bar")를 실행하면, df2의 인덱스와 각 컬럼 값을 사용하여 여러 개의 바 플롯을 그려줍니다. 이 때, 하나의 인덱스에 대하여 이에 대응되는 복수 개의 열 값이 여러 개의 바 플롯으로 나타난 것을 확인할 수 있습니다.

df2 = pd.DataFrame(np.random.rand(6, 4), 
                   index=["one", "two", "three", "four", "five", "six"],
                   columns=pd.Index(["A", "B", "C", "D"], name="Genus"))
df2.plot(kind="bar")

바 플롯을 그릴 때 stacked=True 인자를 넣어주면, 하나의 인덱스에 대한 각 열의 값을 한 줄로 쌓아 표시해줍니다. 이는 하나의 인덱스에 대응되는 각 열 값의 상대적 비율을 확인할 때 유용합니다.

df2.plot(kind="barh", stacked=True)

히스토그램(histogram)

히스토그램의 경우 어느 하나의 변수 X가 가질 수 있는 값의 구간을 여러 개 설정한 뒤, 각각의 구간에 속하는 갯수를 막대 형태로 나타낸 플롯입니다. Series로부터 히스토그램을 그릴 때는 인덱스를 따로 명시할 필요가 없으며, 그저 값들만 가지고 있으면 됩니다.

랜덤한 값들로 구성된 Series s3를 생성한 뒤 s3.hist()를 실행하면, s3의 값을 사용하여 히스토그램을 그려줍니다.

s3 = pd.Series(np.random.normal(0, 1, size=200))
s3.hist()

각 구간에 속하는 값의 갯수를 카운팅할 때, 구간의 개수는 자동으로 10개로 설정되어 있습니다. 이 구간을 'bin(빈)'이라고 부릅니다. 여러분이 히스토그램을 그릴 때, 다음과 같이 bin의 갯수를 직접 설정할 수도 있습니다.

s3.hist(bins=50)

만약 normed=True 인자를 넣어주면, 각 bin에 속하는 갯수를 전체 갯수로 나눈 비율, 즉 정규화한(normalized) 값을 사용하여 히스토그램을 그립니다.

s3.hist(bins=100, normed=True)

산점도(scatter plot)

라인 플롯이나 바 플롯의 경우 어떤 독립변수 X가 변화함에 따라 종속변수 Y가 어떻게 변화하는지 나타내는 것이 목적이었다면, 산점도의 경우 이 보다는 서로 다른 두 개의 독립변수 X1, X2 간에 어떠한 관계가 있는지 알아보고자 할 때 일반적으로 많이 사용합니다. 즉 산점도는, 두 독립변수 X1과 X2의 값을 각각의 축으로 하여 2차원 평면 상에 점으로 나타낸 플롯입니다.

랜덤한 값들로 구성된 두 개의 array를 생성한 뒤, np.concatenate() 함수를 사용하여 이들을 열 방향으로 연결합니다.

x1 = np.random.normal(1, 1, size=(100, 1))
x2 = np.random.normal(-2, 4, size=(100, 1))
X = np.concatenate((x1, x2), axis=1)

이렇게 생성된 X array를 사옹하여 새로운 DataFrame df3를 생성하면, df3에는 'x1'과 'x2'의 두 개의 열이 포함되어 있습니다. plt.scatter(df3["x1"], df3["x2"])를 실행하면, 두 열 간의 값을 기준으로 산점도를 그립니다.

df3 = pd.DataFrame(X, columns=["x1", "x2"])
plt.scatter(df3["x1"], df3["x2"])

얻어진 산점도의 수평축에는 'x1'의 값, 수직축에는 'x2'의 값을 사용하여 해당하는 위치에 점을 찍어서 데이터를 표현합니다.

플롯 모양 변형하기

Figure, subplots 및 axes

matplotlib에서는 'figure(피겨)'라는 그림 단위를 사용하여, 이 안에 한 개 혹은 복수 개의 플롯을 그리고 관리할 수 있도록 하는 기능을 지원합니다. 이 때, figure 안에 들어가는 플롯 공간 하나를 'subplot(서브플롯)'이라고 부릅니다.

새로운 figure를 직접 생성하고자 할 경우, plt.figure() 함수를 사용합니다. fig라는 이름의 figure에 subplot을 하나 추가하고 싶으면, fig.add_subplot() 함수를 실행하여 그 반환값을 새로운 변수로 받습니다.

fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)

fig.add_subplot() 함수에는 총 3개의 인자가 들어갑니다. 앞의 두 개는 해당 figure 안에서 subplot들을 몇 개의 행, 몇 개의 열로 배치할 것인지를 나타냅니다. 맨 마지막 인자는, 이렇게 지정한 subplot들의 배치 구조 상에서, 해당 subplot을 실제로 어느 위치에 배치할지를 나타내는 번호입니다.

fig.add_subplot() 함수의 반환값을 ax1이라는 변수에서 받는데, 이는 해당 subplot에 그려진 빈 좌표 평면을 나타내는 변수입니다. matplotlib에서는 이 빈 좌표평면을 'axes(액시스)'라고 부릅니다. figure 안의 subplot에 axes를 생성한 순간부터, 비로소 여기에 플롯을 그릴 수 있는 상태가 됩니다.

같은 방법으로 fig.add_subplot() 함수를 여러 번 실행하여, 각 subplot 위치별로 새로운 axes를 생성함으로써 플롯을 그릴 준비를 갖출 수 있습니다.

ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)

이렇게 생성된 각각의 subplot 내 axes들에 실제 플롯을 그려보도록 합시다. 만약 plt.plot() 함수를 실행하여 플롯을 그리는 경우, 현재 활성화되어 있는 figure의 맨 마지막 위치에 해당하는 subplot의 axes부터 차례대로 플롯이 그려지게 됩니다.

plt.plot(np.random.randn(50).cumsum())

반면 ax1.hist()와 같이 특정 axes를 나타내는 변수 ax1을 직접 지정하여 플롯을 그리는 경우, 해당 axes에 플롯을 그립니다.

ax1.hist(np.random.randn(100), bins=20)
ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30))

figure와 subplot을 그릴 때, plt.subplots() 함수를 사용하면 좀 더 직관적으로 할 수 있습니다.

fig, axes = plt.subplots(2, 3)

예를 들어 위와 같이 실행하게 되면, fig figure 안에 총 6개의 subplot들을 2x3으로 배치하며, 각각의 내부에 axes를 생성합니다. 이 때 반환받은 axes 함수에는, subplot의 구조와 동일한 구조를 가지는 2x3 크기의 array가 들어가며, 각각의 성분이 곧 대응되는 위치의 axes가 되므로 이를 사용하여 원하는 위치에 플롯을 그릴 수 있습니다.

색상, 마킹 및 라인 스타일

라인 플롯의 경우, 라인 색상과 마킹 기호 및 라인 스타일 등을 지정할 수 있습니다.

plt.plot() 함수를 사용해서 라인 플롯을 그릴 때, color, marker, linestyle 인자의 값을 함께 입력하면 각각 각각 라인 색상, 점을 마킹하는 기호, 라인 스타일을 지정할 수 있습니다.

plt.plot(np.random.randn(30), color="g", marker='o', linestyle="--")

이들 각각에 입력할 수 있는 값은 matplotlib에서 따로 정의되어 있습니다. 예를 들어 color="g"면 녹색, marker='o'면 O 모양의 마킹 기호, linestyle="--"이면 점선 스타일을 적용하게 됩니다.

matplotlib에서 사용 가능한 주요 color, marker, linestyle 값을 본 강의노트 맨 하단에 정리해 놓았으니 참고하시길 바랍니다.

만약 여러분들이 코드를 길게 작성하기 귀찮은 경우, 라인 플롯의 색상, 마킹 및 라인 스타일을 나타내는 값들을 하나의 문자열로 붙여서 입력할 수도 있습니다.

plt.plot(np.random.randn(30), "k.-")

한편 바 플롯이나 히스토그램, 산점도 등에는 색상과 알파값 등을 지정할 수 있습니다. 다음 코드를 실행하여 결과를 관찰해 봅시다.

fig, axes = plt.subplots(2, 1)
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop'))
data.plot(kind="bar", ax=axes[0], color='k', alpha=0.7)
data.plot(kind="barh", ax=axes[1], color='g', alpha=0.3)

눈금, 레이블 및 범례 등

여러분이 그린 플롯의 눈금, 레이블, 범례 등을 수정할 수 있습니다. 우선 figure를 하나 만든 뒤, subplot axes를 하나 추가하고 여기에 랜덤한 값들의 누적합을 나타내는 라인 플롯을 하나 그립니다.

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.plot(np.random.randn(1000).cumsum())

플롯의 수평축 혹은 수직축에 나타난 눈금을 matplotlib에서는 '틱(tick)'이라고 부릅니다. 특별히 수평축의 눈금은 'xtick', 수직축의 눈금은 'ytick'이라고 부릅니다. 수평축의 눈금을 다른 것으로 변경하고자 할 경우, ax.set_xticks() 함수를 사용합니다.

ticks = ax.set_xticks([0, 250, 500, 750, 1000])

ax.set_xticklabels() 함수를 사용하여, 수평축의 눈금을 숫자가 아닌 문자열 레이블로 대체할 수도 있습니다.

labels = ax.set_xticklabels(["one", "two", "three", "four", "five"],
                            rotation=30, fontsize="small")

만약 수직축의 눈금을 변경하고자 한다면, 수평축의 경우와 완전히 동일한 방식으로 ax.set_yticks() 등의 함수를 사용하면 됩니다.

axes의 제목을 입력하고자 할 경우, ax.set_title() 함수를 사용하면 됩니다. 만약 수평축과 수직축에 이름을 붙이고 싶다면, 각각 ax.set_xlabel(), ax.set_ylabel() 함수를 사용하면 됩니다.

ax.set_title("Random walk plot")
ax.set_xlabel("Stages")
ax.set_ylabel("Values")

만약 하나의 axes에 표시한 플롯의 개수가 많다면, 범례(legend)를 표시해야 할 필요가 있습니다. 먼저 새로운 figure 안에 subplot axes를 하나 생성합니다.

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

다음으로, 랜덤 워크 플롯을 ax axes에 3개 추가합니다. 이 때, ax.plot() 함수를 사용할 시 label 인자의 값을 함께 입력해 줍니다. 입력한 label 인자의 값이, 나중에 axes에 범례를 표시할 때 각각의 이름으로 제시됩니다.

ax.plot(np.random.randn(1000).cumsum(), 'k', label="one")
ax.plot(np.random.randn(1000).cumsum(), "b--", label="two")
ax.plot(np.random.randn(1000).cumsum(), "r.", label="three")

ax.legend() 함수를 실행하면, axes 상에 범례가 표시됩니다. 이 때 loc="best" 인자를 넣어주면, 현재 제시된 axes 상에서 최적의 위치에 범례를 자동으로 배치합니다.

ax.legend(loc="best")

현재 axes에 표시된 수평축 값의 범위와 수직축 값의 범위를 변경하고자 한다면, ax.set_xlim() 함수와 ax.set_ylim() 함수를 사용하면 됩니다.

ax.set_xlim([100, 900])
ax.set_ylim([-100, 100])

부록: matplotlib에서 사용 가능한 주요 color, marker, linestyle 값

matplotlib에서 사용 가능한 주요 color 값

값	색상
"b"	blue
"g"	green
"r"	red
"c"	cyan
"m"	magenta
"y"	yellow
"k"	black
"w"	white

matplotlib에서 사용 가능한 주요 marker 값

값	마킹
"."	point
","	pixel
"o"	circle
"v"	triangle_down
"^"	triangle_up
"<"	triangle_left
">"	triangle_right
"8"	octagon
"s"	square
"p"	pentagon
"*"	star
"h"	hexagon
"+"	plus
"x"	x
"D"	diamond

matplotlib에서 사용 가능한 주요 linestyle 값

값	라인 스타일
"-"	solid line
"--"	dashed line
"-."	dash-dotted line
":"	dotted line
"None"	draw nothing

matplotlib를 사용한 데이터 시각화 맛보기

Game of Thrones 데이터셋 분석하기

[Game of Thrones 데이터셋 다운로드]

Game of Thrones 데이터셋의 주요 컬럼 요약

battles.csv

name: String variable. The name of the battle.
year: Numeric variable. The year of the battle.
battle_number: Numeric variable. A unique ID number for the battle.
attacker_king: Categorical. The attacker's king. A slash indicators that the king charges over the course of the war. For example, "Joffrey/Tommen Baratheon" is coded as such because one king follows the other in the Iron Throne.
defender_king: Categorical variable. The defender's king.
attacker_1: String variable. Major house attacking.
attacker_2: String variable. Major house attacking.
attacker_3: String variable. Major house attacking.
attacker_4: String variable. Major house attacking.
defender_1: String variable. Major house defending.
defender_2: String variable. Major house defending.
defender_3: String variable. Major house defending.
defender_4: String variable. Major house defending.
attacker_outcome: Categorical variable. The outcome from the perspective of the attacker. Categories: win, loss, draw.
battle_type: Categorical variable. A classification of the battle's primary type. Categories: pitched_battle: Armies meet in a location and fight. This is also the baseline category. ambush: A battle where stealth or subterfuge was the primary means of attack. siege: A prolonged of a fortied position. razing: An attack against an undefended position
major_death: Binary variable. If there was a death of a major figure during the battle.
major_capture: Binary variable. If there was the capture of the major figure during the battle.
attacker_size: Numeric variable. The size of the attacker's force. No distinction is made between the types of soldiers such as cavalry and footmen.
defender_size: Numeric variable. The size of the defenders's force. No distinction is made between the types of soldiers such as cavalry and footmen.
attacker_commander: String variable. Major commanders of the attackers. Commander's names are included without honoric titles and commandders are seperated by commas.
defender_commander: String variable. Major commanders of the defener. Commander's names are included without honoric titles and commandders are seperated by commas.
summer: Binary variable. Was it summer?
location: String variable. The location of the battle.
region: Categorical variable. The region where the battle takes place. Categories: Beyond the Wall, The North, The Iron Islands, The Riverlands, The Vale of Arryn, The Westerlands, The Crownlands, The Reach, The Stormlands, Dorne
note: String variable. Coding notes regarding individual observations.

character-deaths.csv

Name: character name
Allegiances: character house
Death Year: year character died
Book of Death: book character died in
Death Chapter: chapter character died in
Book Intro Chapter: chapter character was introduced in
Gender: 1 is male, 0 is female
Nobility: 1 is nobel, 0 is a commoner
GoT: Appeared in first book
CoK: Appeared in second book
SoS: Appeared in third book
FfC: Appeared in fourth book
DwD: Appeared in fifth book

* 참고: https://www.kaggle.com/mylesoneill/game-of-thrones

작품 번호에 따른 인물들의 죽음 횟수 시각화하기 - 라인 플롯

book_nums_to_death_count = deaths["Book of Death"].value_counts().sort_index()
ax1 = book_nums_to_death_count.plot(color="k", marker="o", linestyle="--")
ax1.set_xticks(np.arange(1, 6))
ax1.set_xlim([0, 6])
ax1.set_ylim([0, 120])

대규모 전투 상에서 공격군과 수비군 간의 병력 차이 시각화하기 - 박스 플롯

battles = battles.set_index(["name"])
large_battles_mask = battles["attacker_size"] + battles["defender_size"] > 10000
large_battles = battles.loc[large_battles_mask, ["attacker_size", "defender_size"]]
ax2 = large_battles.plot(kind="barh", stacked=True, fontsize=8)

large_battles["attacker_pcts"] = \
    large_battles["attacker_size"] / (large_battles["attacker_size"] + large_battles["defender_size"])
large_battles["defender_pcts"] = \
    large_battles["defender_size"] / (large_battles["attacker_size"] + large_battles["defender_size"])
    ax3 = large_battles[["attacker_pcts", "defender_pcts"]].plot(kind="barh", stacked=True, fontsize=8)

전체 전투 중 각 가문의 개입 빈도 시각화하기 - 히스토그램

col_names = battles.columns[4:12]
house_names = battles[col_names].fillna("None").values
house_names = np.unique(house_names)
house_names = house_names[house_names != "None"]
houses_to_battle_counts = pd.Series(0, index=house_names)

for col in col_names:
    houses_to_battle_counts = \
        houses_to_battle_counts.add(battles[col].value_counts(), fill_value=0)
ax4 = houses_to_battle_counts.hist(bins=10)

5-5. matplotlib